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Preface 


Welcome to the USENIX Summer 1994 Conference! 


Thanks to the bush-beating efforts of the Program Committee, we received 
more than a hundred paper submissions. We hope that you find the results both 
interesting and educational. 


This conference has certainly been “interesting and educational”’ for us! It 
would not have been possible without a Program Committee that went far beyond 
the call of duty, and the help of the many outside reviewers. Evi Nemeth was a 
great help as Board Liaison. Brent Welch and Bob Gray designed the Invited 
Talks track, and Bob served on the Program Committee as well. Peter Salus 
orchestrated the historical events, Dan Klein put the tutorial track together, Peg 
Shafer coordinated the Works-in-Progress session and Ed Gould arranged the 
“Guru is in” sessions. And, of course, the staff at Usenix (that have to put up 
with the whims of new program chairs every six months) continue to do a fabu- 
lous job. Carolyn Carr put the proceedings together, Cynthia Deno organized and 
produced the Call for Papers and Programs, Judy DesHarnais handled all of the 
conference logistics, Toni Veglia managed the rest of the details, and Ellie Young 
ran the show. If you see these people at the conference, please offer them your 
thanks; they make the Usenix conferences happen. 


We wish you an exciting 25th Anniversary and a great conference. 
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A New Object-Oriented Programming Language: sh 


Jeffrey S. Haemer 
Canary Software, Inc. 


Abstract 


Many have frittered away their time on C++, 
while overlooking the new, POSIX.2-required, object- 
oriented language: sh. As will be clear from the 
enclosed code, the name may allude to the fact that 
the author would be embarrassed to have anyone find 
out about it. 


This paper introduces a tiny, object-oriented 
programming system written entirely in POSIX- 
conforming shell scripts. 


1. Overview 


Object-oriented programming is currently all the rage 
[King89]. Though we normally use languages 
designed specifically for the task, they aren’t always 
necessary. Here, we illustrate this point by doing 
object-oriented programming in the shell. 


In what follows, object classes are shell scripts 
and objects are running processes. Methods are 
invoked by messages passed to objects through 
FIFOs (named pipes). The methods themselves are 
implemented as shell functions; function polymor- 
phism is guaranteed because separate programs have 
separate name spaces. A class hierarchy is provided 
by the file system itself. 


Sensible default actions are taken by objects 
when they’re sent messages for which they lack 
explicitly defined methods. Debugging code can be 
added to objects on the fly, that is, after they’ ve been 
created. 


While the system is unconventional, only a toy, 
and downright slow, its implementation is straightfor- 
ward and its use instructive. For example, figure 1 
shows an implementation of the examples used in 
Roger Sessions’ Summer, ’93 USENIX Invited Talk 
[Sessions93]. 
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Sessions’ talk used application code size as one 
measure of the advantages of object-oriented pro- 
gramming. By that measure, and with the same 
examples, this system is better than C++. In fact, the 
core of the entire system, the two shell scripts create 
and send, total a little over 100 lines of code. If you 
don’t find your favorite OOP feature, it may not be 
very hard to add it. 


S$ cat animalia 
new animal pooh bugs 


send pooh setName Pooh Bear 
send pooh setFood Hunny 

send bugs setName Bugs Bunny 
send bugs setFood Carrots 


for i in pooh bugs 


do 
send $i getName 
send $i getFood 
done 
echo 


new dog Snoopy 
new littleDog Toto 
new bigDog Lassie 


for i in Snoopy Toto Lassie 


do 
echo $i says 
send $i bark 
echo 

done 


destroy bugs pooh 
destroy Snoopy Toto Lassie 





S$ ./animalia 

My name is: Pooh Bear 

My favorite food is: Hunny 
My name is: Bugs Bunny 

My favorite food is: Carrots 


Snoopy says 
Unknown Dog Noise 


Toto says 
woof woof 
woof woof 


Lassie says 
WOOF WOOF 
WOOF WOOF 
WOOF WOOF 
WOOF WOOF 
WOOF WOOF 


Figure 1. Animalia 


2. Design 


As a foundation, we begin by reviewing the three 
basic object-oriented features: encapsulation, func- 
tion polymorphism, and inheritance. These require- 
ments, plus sloth — an eagerness to let UNIX and the 
shell do as much of the job as possible — lead 
directly to most major design decisions. 


Encapsulation 


All object-oriented systems provide data 
abstraction and encapsulation; they let pro- 
grammers create and operate on objects that 
have user-defined types, while hiding all 
knowledge about how those operations and 
types are implemented. Programs are pre- 
vented from manipulating an object’s internal 
data structures except through the methods that 
operate on those objects. 


UNIX processes are attractive candidates for 
objects. Each running, UNIX process has its 
own name and address space; it’s impossible to 
look at or tweak the insides of an already run- 
ning process unless the process is running 
under a debugger. 


Objects and the object-class hierarchy 


Object-oriented programming systems try to 
maximize code reuse by letting new data types 
inherit methods and the data structures they 


operate on from their ‘“‘parent” classes. This 
sort of inheritance produces a tree-structured 
class hierarchy. Implementing class hierarchies 
means picking a way to define trees. 


UNIX has two ubiquitous tree structures that 
seem plausible choices, either of which might 
be worth barking up: the file system and the 
process tree. Deciding to make object classes 
conceptually distinct from objects, and making 
the latter simply instances of object classes, 
guides our choice. In this system, executing 
processes are objects; programs are their defi- 
nitions, the object classes. The file system 
implements the class hierarchy. Although a 
process can be any executing program image, 
in this system all objects will be shell scripts. 


Polymorphism 


A good object-oriented system lets the pro- 
grammer extend the suite of object classes 
without changing existing software. Programs 
can operate on new kinds of data that get 
defined long after the code is written. The 
same operation may be implemented different 
ways for different kinds of data, but the invoca- 
tion of the operation is identical. What sepa- 
rate compilation provides for functions, object- 
oriented programming provides for data. 


For example, if each new object responds to 
pass(), there is no need to change base code 
from 


pass (X) 
to 


if (object_type(X) ==BILL) 
pass_bill(X); 

else if (object_type(X) ==BUCK) 
pass_buck(X) ; 

else if (object_type(X) ==GAS) 


C++ and UNIX device drivers implement this 
sort of behavior by using tables of function 
pointers. read() is always read(), whatever the 
device, because the kernel figures out what 
routine to call based on the device that’s being 
read from. 


Many other object-oriented systems, like 

Smalltalk, implement polymorphism by pass- 

ing messages between objects, which interpret 

the messages at runtime. Thus, the instruction 
X pass 


works whether X is a bill or a buck, because it 
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just sends a message. Both bill and buck 
understand the message “pass,” but each 
implements the operation with a data-type- 
specific method. 


Here, we take this latter approach. Class defi- 
nitions are shell scripts. Methods are shell 
functions. Messages are just the names of the 
functions, sent as ASCII strings. Two different 
scripts are free to use identical names for com- 
pletely different functions. 


One side-effect of this approach is that mes- 
sages sent to objects that are not names of 
functions are interpreted as other sorts of exe- 
cutable statements: built-ins and shell com- 
mands. On first blush, this seems like a horri- 
ble bug; in practice, and to my surprise, it feels 
like a feature. 


3. Implementation 
Each object is a simple, infinite loop: 


forever 
read message from FIFO 
execute it as a command 


The FIFOs are created in a pre-arranged spot in the 
file system and have names tied to the names of their 
corresponding objects. Messages are sent with the 
program send. The command 


$ send pooh setFood Hunny 


just writes the message setFood Hunny to pooh’s 
input channel /tmp/ipc/pooh/in. 


Object creation is trickier, but not by much. 
Each object class is a shell script, stored in a direc- 
tory tree where the directory and subdirectory names 
are class and subclass names. Each directory con- 
tains one script, named class, that defines the 
methods for the class corresponding to that directory. 
A request to create an object of type name starts up a 
process like this: 


use find to find directory name 


source all the class methods 
from the root of the class tree 
down to that definition 


create a FIFO tagged to 
the name of the object 


loop forever, reading and executing messages 
(as shown above). 
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Taken together, send and create are cur- 
rently only a little over 100 lines of shell code. (The 
new and destroy commands, used in the example 
in the first section, are just loops that call create 
and send exit for each of their arguments.) 


A handful of interesting things fall out of 
implementing objects this way. 


- Process manipulation commands can be used 
to handle objects. You can search for objects 
with ps and destroy them with kill -9. 


- Every object understands normal shell com- 
mands. Not only can you see if an object is 
alive, but you can see if it’s paying attention 
with commands like this: 


$ send pooh date 
Sun Jan 23 21:00:25 MST 1994 


- You can kill objects with exit. This is enor- 
mously comforting. (Since I haven’t gone to 
great lengths to make this system any more 
bullet-proof than it deserves to be, it’s also 
enormously necessary.) 


* You can add methods to objects on the fly. 
This sort of thing actually works: 


S$ send X ’zzazz() { echo foo }’ 
S$ send X zzazz 
foo 


From time to time, this last feature has proven 
itself a useful debugging tool. 


4, Real Code 


Enough abstract chatter. Let’s see some code. 


4.1. send 


send, shown in figure 2, sends messages to 
objects. 


# send a message 


case $1 in 
-d) msgtype=D 
shift ;; 
*) msgtype=C ;; 
esac 


T NAME=S1 
shift 


T_DIR=/tmp/ipc/$T_NAME 


T _IN=ST_DIR/in 

Tf. OUT=ST __DIR/oUut 

USAGE=\ 

"usage: $(basename $0) obj msg" 


abort() { # print and bail 
echo $* 1>&2 
exit 1 


} 
test $# -gt 0 || abort SUSAGE 


test -d $T_DIR || 
abort $T_NAME: no such object 


echo -e Smsgtype "$*" > ST_IN 
test $msgtype = "D" || cat $T_OUT 


exit 0 


Figure 2. send 


All object I/O channels are in subdirectories of 
/tmp/ipc. Each named object has a subdirectory that 
corresponds to its name, T_DIR, and all files associ- 
ated with that object are within that directory. By 
default, each object has at least an input channel, 
T_IN, through which messages arrive, and an output 
channel, T_OUT, to which returns are sent. 


The code shown above sets environment vari- 
ables to point at the right channels, and then, after a 
brief sanity check, echoes its argument into the input 
channel and reads the objects response from the out- 
put channel. 


There are at least two restrictions of this 
design. First, the name space is global to the system; 
only one object on the entire system can be called 
“foo” at any given time. Second, there’s no provi- 
sion for tying returns to the messages that elicited 
them; if two objects send messages to a third at 
nearly the same time, there isn’t any way to guarantee 
that the return value one of the senders retrieves cor- 
responds to the message it sent. A more sophisti- 
cated implementation might nest object directories as 
subdirectories under the directories of the objects that 
created them, and use a more sophisticated messag- 
ing scheme to provide a virtual circuit between the 
messenger and the messagee. 


Even after accepting these limitations, at least 
two problems require immediate solution. 


The first of these is the deadlock that arises 
when object A sends a message to object B and 
object B, or some object further down the line, sends 
a message to object A before object B has replied to 
A’s original message. In these cases, object A cannot 
read the incoming message because it is blocked 
reading B’s output channel. The general case is a 
general problem, but in a some cases object A doesn’t 
really need B’s answer, and can go on to listen for 
incoming messages as soon as it dispatches a mes- 
sage to B. For just such cases, send accepts a flag, 
-d, that means “don’t wait for an answer.” Send 
prepends a ‘D’ to such “datagrams,” and replies are 
neither expected nor supplied. 


The second problem is trickier. In the absence 
of special arrangements, an open of a FIFO for writ- 
ing will only complete when that FIFO has an avail- 
able reader. Consider, then, what happens when 
object A sends a message to itself, using a command 
like 

echo $msg > /tmp/ipc/A/in. 


The echo will block, awaiting a reader, preventing A 
from ever executing the read that would move echo 
past the block. 


The current implementation side-steps this 
problem by providing each object with a built-in ver- 
sion of send. Whenever an object notices that it is 
sending a message to itself, it executes the message 
directly instead of trying to write the message to its 
Own input channel. (See figure 7.) An alternative to 
this would be putting echo in the background, but 
that would use up process slots, a resource that this 
system already strains. Another alternative might be 
writing substitutes for read and echo. 


4.2. create 


More complex than send is create, shown in 
figure 3. 


# create a new object 


O_DIR=/tmp/ipc/$2 
O_IN=$O_DIR/in 
O_OUT=$O_DIR/out 


O_BIN=S$ {O_BIN: -SHOME/obj /bin} 
O_CLASS=$1 

O_NAME=$2 

O_PATH=/bin: /usr/bin:$O_BIN 
O_ROOTS=S$ {O_ROOTS : -SHOME/obj /objs} 
export O_BIN O_CLASS O_NAME 
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export O_PATH O_ROOTS 


USAGE=\ 
"usage: $(basename $0) class obj" 


abort() { 
echo $* 1>&2 
exit 1 


} ’ 


test $# -eq 2 || abort $SUSAGE 


# cleanliness is next to godliness 
cleanup() { 

trap "" 012 3 £5 

rm -rf SO_DIR 

exit 0 


} 


event_loop() { 
trap "cleanup" 0 1 2 3 15 
while read pkt < S$O_IN 
do 
type=${pkt%% *} 
msg=${pkt#[A-Z] } 


if test $type = "D" 


then 
PATH=SO_PATH eval $msg 
else 
PATH=SO_PATH eval $msg \ 
>$SO_OUT 
oe 
# hack around BSDI timing bug 
sleep 1 
done 


} 
get_obj_chain() { 


# find object and superclasses 
IFS=: | 
set $O_ROOTS 
IFS=’ . 
for 2 
do 
test -f $i/class || continue 
obj_root=s ( 
find Si \ 
-type d \ 
-name SO_CLASS \ 
-print 
) 
test -n Sobj_root && break 
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done 
test -z Sobj_root && abort SUSAGE 


# now set up paths 
d=Sobj_root 
O_PATH=$O_PATH: $d 
O_DEFS=$d 
while test "$d" != "Si" 
do 

d=$ {d%/*} 

O_DEFS=$d" $O_DEFS" 

O_PATH=$0_PATH: $d 
done 


unset di 


} 


# create message channels 
make_channels() { 
test -d SO_DIR && 
abort "Duplicate $O_NAME" 
mkdir -p $O_DIR || 
abort "Can’t make $O_DIR" 
mkfifo $O_IN | | 
abort "Can’t make $O_IN" 
mkfifo $O_OUT | | 
abort "Can’t make $O_OUT" 
} 


# build the object from definitions 


mkobj() { 
for d in $SO_DEFS 
do 


Sd/class 2>/dev/null 
done 


} 


get_obj_chain 
make_channels 
mkobj 
event_loop & 


Figure 3. create 





Following initialization and sanity checks, create 
makes four function calls to create an object. The 
first, get_obj_chain, sets the variable O_DEFS to a 
list of directory names, starting at the root object 
directory, that end in the directory that defines the 
class. For the class littleDog, from our earlier exam- 
ple, O_DEFS would be set to objs objs/animal 
objs/animal/dog 
objs/animal/dog/littleDog 


Next, make_channels creates the input and out- 
put channels used by send. 


Third, mkobj visits the directories in O_DEFS, 
reading class definitions. Because of the order in 
which get_obj_chain sets up O_DEFS, methods 
defined in subclasses supplement or override those 
defined in parent classes. 


Finally, event_loop loops infinitely, reading 
messages and writing responses on the pair of mes- 
Sage queues set up by make_channels. If the mes- 
sage is the name of a function call — a method — 
that function is invoked. Otherwise, eval looks for a 
shell built-in or a UNIX command to execute. 


A disadvantage of the approach sketched above 
is that there isn’t an easy way to say “‘use my parent’s 
definition of this method’’; when an object definition 
overrides a method defined by a parent, that parental 
method becomes completely unavailable. 


In an earlier version of this code, event_loop 
looked like this: 


event_loop() { 
trap "cleanup" 0 12 3 15 
while read msg < $O_ICHAN 
do 
eval $msg > $SO_OCHAN 
done 
cleanup 


In the version shown in figure 3, get_obj_chain stores 
the path to the directory that contains the object defi- 
nition in O_PATH, for later use in creative ways, 
including prefixing it to the path used by eval. Hav- 
ing that path available makes it possible to back up 
through the class hierarchy searching for a parental 
method. I’ve experimented with this, but the trick 
isn’t entirely satisfactory; a more sophisticated imple- 
mentation should find a more interesting way to use 
O_PATH or O_DEFS to gain access to parental-class 
methods. 


Like send, which has a single, global name- 
space for objects, create uses a global name space for 
object classes. The system will not support two dif- 
ferent definitions of class “‘dog”’. On the other hand, 
users can point at their own object-class definitions 
by setting O_ROOTS, even invocation by invocation. 


Another limitation of this system is that it is 
restricted to single inheritance; each class has one 
and only one parent class — that of its parent direc- 
tory. Although links might make it possible to pro- 
vide an interesting way to implement and explore 


multiple inheritance, everyone knows that multiple 
inheritance is a bad idea [Cargill91]. 


4.3. new and destroy 


Returning now to the example, we can show 
new (figure 4) and destroy (figure 5). 





# Create a set of objects 


USAGE=\ 
"new class obj [obj ...]" 
abort() { 

echo $* 1>&2 

exit 1 


} 


test $# -ge 2 || 
abort "SUSAGE" 


class=$1 
shift 


for’ i. 
do 

create Sclass Si 
done 


Figure 4. new 





# destroy a set of objects 


USAGE=\ 
"destroy obj [obj ...]" 
abort() { 

echo $* 1>&2 

exit 1 


} 


test $# -ge 1 || 
abort "SUSAGE" 


for i 
do 

send $i destroy 
done 


Figure 5. destroy 





As advertised earlier, each of these is a simple 
loop. The earliest version of destroy was even 
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simpler: 


for i 
do 

send $i exit 
done 


This works because each object inherits the methods 
understood by the base class — the shell augmented 
by a few basic methods. The current version calls 
destroy instead, which lets each object define its own 
destructor. (The base class defines a simple default 
destroy, shown in the next section.) 


One alternative to explicit calls to destroy for 
each object would be to make new keep track of 
objects it has created and let destroy destroy every- 
thing. 

Because of encapsulation, the easiest imple- 
mentation would incorporate new as a method in a 
more sophisticated base class. An odd side-effect of 
this is that classes could redefine new. 


4.4. hop: a simple class 


Having constructed the infrastructure, let’s 
look at a simple class definition (figure 6). 





$ cat objs/hop 
# class hop 


hop() { 
if test “$*" = “on pop” 
then 
echo -n "Stop! " 
echo "You must not hop on pop." 
else 
echo "hippity hop" 
Ei, 
return 0 
} 
S$ create hop X 
S$ send X hop 
hippity hop 
$ send X hop on pop 
Stop! You must not hop on pop. 
$ send X foo 
S send X exit 
$ send X hop 
X: no such object 


Figure 6. Class hop 





The entire class definition is a single method: hop. 


Although this is an elementary example, it 
illustrates a few interesting points: 


(1) Defining methods is easy; it doesn’t require a 
lot of special syntax. 


(2) Class definitions are small. While code size 
is not the only measure of the quality of a 
programming language — else we would all 
program in APL — code size strongly affects 
maintenance and debugging efforts; bug fre- 
quency per line appears to be roughly con- 
stant across languages [Brooks75]. Less 
code, fewer bugs. 


As an illustration, Sessions contrasts the code 
to make a dog bark in C: 


void printDog(dog *thisDog, 
int dogType) 
{ 
printf("\nts says\n", 
getName((dog *) thisDog)); 
switch dogType { 
case DOG: 
dogBark ( 
(dog *) thisDog) ; 
break; 
case LITTLEDOG: 
littleDogBark ( 
(littledog *) 
thisDog) ; 
break; 
case BIGDOG: 
bigDogBark ( 
(bigdog *) thisDog) ; 
break; 


} 
with the code to do the same job in C++: 


void printDog(dog *thisDog) 
{ 
printf("\n%ts says\n", 
thisDog->getName () ) ; 
thisDog->bark(); 
} 


Here’s the same code in the shell: 
echo $thisDog says 
send $thisDog bark 


(3) We've chosen to ignore nonsense requests. 
The consequences of changing that decision 
can be explored by toying with event_loop in 


ON 
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create 
(4) Methods can have arguments. 


This example becomes even more interesting 
when we notice that we can invoke methods that 
aren’t defined by the class. As figure 7 shows, meth- 
ods defined by parent classes (in this case, the class 
defined in directory objs) are inherited by their sub- 
classes 





$ create hop X 

S$ send X self 

x 

S send X class 

hop 

$ cat objs/class 

# fundamental methods 


abort() { # print and bail 
echo $* 1>&2 
exit 1 


} 


class() { 
echo $SO_CLASS 
} 


debug() { 
echo $SO_NAME: $* 
} 


defs() { 
for d in $O_DEFS 
do 
cat $d/class 2>/dev/null 
done 


destroy() { 
test S# -eq 0 && exit 0 
test $0 = "Self" && exit 0 


£EOxr 2 
do 

send -d $i destroy 
done 


} 


destroy() { 
_destroy $* 
} 


self() { 


echo $O_NAME 
} 


send() { # send a message 


case $1 in 
-d) msgtype=D; 
SHIEt 2: 
*) msgtype=C ;; 
esac 


T_NAME=S$1 

shift 
T_DIR=/tmp/ipc/$T_NAME 
T_IN=ST_DIR/in 
T_OUT=ST_DIR/out 


USAGE="usage: send obj msg" 
test $# -gt 0 || abort SUSAGE 


if test "$T_NAME" = "$O_NAME" | | 
test "“ST_NAME" = "self" 

then 
PATH=$O_PATH eval $* 
return 0 

£7 


test -d $T_DIR || 
abort $T_NAME: no such object 


echo -e $msgtype "$*" > ST_IN 
test $msgtype = "D" || cat $T_oUT 


Figure 7. The base class 





Most of the methods defined in objs/class 
are simple utility methods. The motivation for the 
one long method, send, was given in section 4.1. 


4,5. animals 


The code for to our original animal example 
shows inheritance in practice (figure 8). 





S cat objs/animal 
# base animals methods 


name=S$O_NAME 
food="Unknown food." 


eee Sees 
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setName() { 
name=$* 
} 
getName() { 
echo My name is: Sname 
} 
setFood() { 
£food=$* 
} 
getFood() { 
echo My favorite food is: $food 
} 
S$ cat objs/animal/dog 
# dog: all bark, no bite 


bark() { 
echo Unknown Dog Noise 
} 
S$ cat objs/animal/dog/littleDog 
# a little dog 


bark() { 
echo woof woof 
echo woof woof 


} 


S$ cat objs/animal/dog/bigDog 
# a BIG DOG 


bark() { 
for 2: in. O22 3 4 
do 
echo WOOF WOOF 
done 
} 


Figure 8. Animal Objects 


Here, the class animal defines methods for setting 
and getting an animal’s name and favorite food; the 
subclass dog adds a way to make the animal bark, 
and sub-sub-classes for little and big dogs replace 
that method with ones that generate size-appropriate 
noises. 


5. Applications 


Sir, a woman’s preaching is like a dog’s walking 
on his hinder legs. It is not done well; but you 
are surprised to find it done at all. 


Boswell’s Life of Johnson, vol 1, p 428, 31 July 
1763 
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“Cute idea,” you say, “but is this good for 
implementing real applications?” Probably not. 
Still, it seems worth sniffing around to see what sorts 
of things besides barking dogs might be interesting to 
implement with it. 


5.1. Starting small ... 


When I was soliciting suggestions for interest- 
ing applications to implement, Doug Pintar, of Aztec 
Engineering, laconically suggested emacs. While an 
emacs implementation might not fit within the page- 
limit length imposed by this conference, I can include 
a more contained, but logically equivalent, applica- 
tion. Appendix A shows the code for a Turing 
machine. The text below, sketches the implementa- 
tion of each of the classes. The example, taken from 
the nearest automata theory text to hand [Manna78], 
recognizes strings of the form 


a"b" 


5.1.1. Turing machine The machine itself is an 
object that creates a tape object and five nodes. After 
initializing all the objects, loading the tape with an 
input string and the nodes with their transition tables, 
it starts up by telling the first node to go, and then 
awaits an announcement of success or failure from 
some node down the line. When the announcement 
arrives, the machine writes the result as output; 
destroys the nodes it has created; and exits. 


5.1.2. Tape 


The tape itself is trivial. Input data are stored 
as a string, and there are a handful of methods to 
move along the tape and to read or write at the cur- 
rent position. (We’ve tried to avoid mixed-case dis- 
ease, which seems endemic to object-oriented pro- 
grammers, but Read requires an initial, upper-case 
‘R’ because read is reserved by the shell. Write is 
just following suit.) 


The position is just a numeric index into the 
string, maintained using the POSIX shell’s built-in 
arithmetic facilities. 


5.1.3. Node Nodes, too, are objects. When called 
on, a node reads the current cell on the tape and looks 
up the entry for the character it reads in a dictionary 
of transitions that it creates and maintains (another 
object, described in the next section). Transitions are 
a triple containing a character to write into the current 
cell, a direction to move, and a new node to call. The 
current node writes the prescribed character into the 
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cell, moves either left or right, and then calls on 
another node (possibly itself) to handle the next cell. 
If the dictionary reveals that the node has reached a 
decision to accept or reject the input string, then 
instead of passing control to a new node, the current 
node sends the message accept or reject to the origi- 
nal Turing machine. 


The absence of returns and lack of a single, 
centralized transition table lend a palpably unstruc- 
tured aura to the process. 


Although this implementation uses a colon- 
separated array to store the three pieces of informa- 
tion associated with each possible input character and 
teases them apart with cut, performance could be 
improved somewhat by using the shell’s prefix- and 
suffix-shaving operators — ${PARAME- 
TER%%expression} and friends — to do the parsing 
without recourse to subprocesses. 


5.1.4. Dictionary A dictionary stores its entries 
as shell variables whose names are constructed on- 
the-fly from the words being defined. The command 


$ send $DICTIONARY define and dumb 
turns into the assignment 
def_and=dumb 


In a better world, the shell might have arrays 
(indeed, the Korn shell does), but POSIX shells aren’t 
required to have them, and this work-around is good 
enough for our example. 


5.2. ... then getting smaller. 


Exploiting the ability of objects to learn new 
methods at run-time, we can also create a simpler but 
more tantalizing application. The code shown in fig- 
ure 9 shows a method that sends itself to another 
object: a virus. 


S cat oneline 
# put everything on one line 


ce -@ * \axe’® *f *]" 

S infect() 

>| 

> send $1 \"$(typeset -f infectloneline)\" 
> } 

S new null X 

S$ infect X 

S send X typeset -f infect 

infect () 


send $1 "$(typeset -f infect | 
oneline) " 


new null Y 

send X infect Y 

send Y typeset -f infect 
infect () 


{ 


rin HS 


send $1 "$(typeset -f infect | 
oneline) " 


} 


Figure 9. A Simple Virus 


The script oneline is a work-around for two 
implementation problems. First, with the code shown 
here any methods learned at runtime must fit on a sin- 
gle line. Second, shell quoting conventions make it 
annoyingly difficult to fit the tr command inside the 
function itself. 


(We confess to having resorted to occasional 
non-standard shell extensions to avoid other lengthy 
circumlocutions, particularly echo -e which interprets 
many of the usual shell escape characters like ‘W’, 
and typeset, All these shell scripts run under bash, a 
publicly available Posix-conforming shell, and the 
extensions are those provided by bash. Other POSIX 
shells provide analogous extensions.) 


A more sophisticated infect would go out and 
hunt for other objects to infect, A more sophisticated 
create routine would permit multi-line messages. 


5.3. Summary Neither of these applications is 
particularly long (or useful), but each illustrates the 
capability and extensibility of this relatively simple 
system, and the power and flexibility of the shell as a 
programming language. 


As a parting note, we observe that, the two 
applications can be used in consort: despite its sim- 
plicity, limitations, and implementation dependen- 
cies, the virus shown above can be used to infect the 
Turing machine described above to give it a cold in 
its nodes. 


6. send Paper exit 


This is hardly a complete system. On the other hand, 
it’s so simple that an average undergraduate who’s 
already familiar with UNIX at the shell level should 
be able to play with objects without first having to 
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wrap his mind around a conventional OOPS like C++ 
or Smalltalk. A really good undergraduate should be 
able to enhance it in interesting ways without a 
course in compiler theory. I’ve suggested several 
such enhancements in this paper. 


What’s more, even though the exercise seems 
akin to making a sow’s ear out of a silk purse, it illus- 
trates that the shell has more power than many people 
give it credit for. That said, I’ll raise anew a question 
posed by Steve Johnson at the Winter ’°94 USENIX 
conference: ‘Will object-oriented programming 
replace the shell?” Johnson intended the question to 
be rhetorical, but I harbor the suspicion that object- 
oriented shells, and other shells that break from the 
conventional-programming-language model, are 
fruitful areas of research [Budd89]. Mashey showed 
that creating a shell that was a real programming lan- 
guage was exactly the right idea, and that people 
would use a well-designed shell early and often. 
[Mashey76]. Given that, I’m surprised that nearly all 
widely available shells today still use C, ALGOL, or 
pocket-teller machines as their models. 


(A notable exception is Doug Gwyn’s “‘Adven- 
ture Shell.’’ Though not a wild success as a program- 
mer’s shell, it has spawned, after a trip through a 
maze of twisty passages, the development of MOOs.) 


Personally, [ve long wanted to run “the 
spread-shell” but I haven’t any idea what such a 
thing would do. If you write one, please send it to 
me. 
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Appendix A: A Turing Machine 


# turing machine 
# recognizes a°n bn 


MACHINE=SO_NAME 
TAPE=S {MACHINE}_T 
export MACHINE TAPE 


destroy() { 
#debug destroy $* 
_destroy STAPE sl s2 s3 s4 s5 
exit 
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accept() { 
echo ACCEPT! 
destroy 

} 


reject() { 
echo REJECT! 
destroy 

} 


new tape STAPE 
new node sl s2 s3 s4 s5 


send STAPE load aabb 


# Hard-wire the nodes. 
# It’d be nicer to have this 
# load froma file. 


send sl transition a A:right:s2 
send sl transition X:right:accept 


send s2 transition B B:right:s2 
send s2 transition a a:right:s2 
send s2 transition b B:left:s3 


send s3 transition B B:left:s3 
send s3 transition a a:left:s4 


send s3 transition A A:right:s5 


send s4 transition a a:left:s4 
send s4 transition A A:right:sl 


send s5 transition B B:right:s5 
send s5 transition X:right:accept 


send -d sl goto 


Figure Al. Turing Machine 


# Turing machine tape 


unset S 
typeset -in 
typeset -i j 


right() { 
let n=n+1 
return 0 


left() { 


} 


if test $n -le 1 
then 
echo HALT 
return 1 
else 
let n=n-1 
return 0 
Ps 


load() { 


} 


S61. 
let n=1 
return 0 


print() { 


} 


echo $S 
let j=n 
while let j=j-1 
do 

echo -n ’ ’ 
done 
echo ’ 
return 0 


Om g 


Write() { 


} 


let left_neighbor=n-1 
let right_neighbor=n+1 
left=$(echo $S | 

cut -c -Sleft_neighbor) 
right=$(echo $s | 

cut -c Sright_neighbor-) 
S=${left}$1$ {right} 
return 0 


Read() { 


if test $n -gt ${#S} 
then 

echo ‘’_’ > $O_OUT 
else 

echo §S | 

cut -c $n >SO_OUT 

fi 
return 0 


Figure A2. Turing Machine Tape 


# Turing machine node 
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XITIONS=dict_$O_NAME 
new dict SXITIONS 


transition() { 
send $XITIONS define $* 
return 0 


} 


destroy() { 
send -d S$XITIONS destroy $* 
_destroy $* 

} 


goto() { 
SYMBOL=$ (send S$TAPE Read) 
ACTIONES ( 
send $XITIONS lookup $SYMBOL 
) 


debug $SYMBOL, SACTION 


if test -z "SACTION" 

then 
send -d $SMACHINE reject 
return 0 

£3 


OUT_CHAR=$(echo S$ACTION | 
cut -f£ 1 -d:) 

DIRECTION=$ (echo SACTION | 
cut -f£ 2 -d:) 

NEXT_STATE=$ (echo $ACTION | 
cut -f£ 3 -d:) 


i£ test SNEXT_STATE = “accept” 
then 
send -d SMACHINE accept 
return 0 
£3 


send STAPE Write SOUT_CHAR 
send STAPE SDIRECTION 


send -d S$NEXT_STATE goto 
return 0 


Figure A3. Turing Machine Node 


# Small dictionary 


dictionary() { 
set | sed -n 's/*def_//p’ 
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return 0 


} 


define() { 
eval def_$1="$2" 
return 0 


} 
lookup() { 


eval echo $"def_$1" 
return 0 


Figure A4. Simple Dictionary 
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The Old Man and the C 


Evan Adams 
Sun Microsystems 


Abstract 


“You can't teach an old dog new tricks” goes the old 
proverb. This is a story about a pack of old dogs (C 
programmers) and their odyssey of trying to learn new 
tricks (C++ programming). 


C++ is a large, complex language which can easily 
be abused, but also includes many features to help 
programmers more quickly write higher quality code. 
The TeamWare group consciously decided which C++ 
features to use and, just as importantly, which features 
not to use. We also incrementally adopted those 
features we chose to use. This resulted in a successful 
C++ experience. 


1.0 Introduction 


This paper describes the experience of a group of C 
programmers adopting C++ for a new project. It is 
written from the viewpoint of C programmers and 
describes our expectations, surprises, pleasures, 
disappointments and trials and tribulations. It is 
intended for C programmers that may be considering a 
journey into the realm of C++. It is not intended to be 
a critique or evaluation of C++ as a programming 
language. 


The TeamWare project consisted of 8 very 
experienced C programmers, ranging from 4 to 13 
years of industrial C experience. In the spring of 1991 
we decided to implement the TeamWare project in 
C++. It was hoped that using C++ would lead to more 
code sharing, more cleanly structured code and 
improved internal interfaces. No one within the project 
had any significant experience with C++, although two 
members of the group had some object oriented 
experience. 


TeamWare is a set of command line and GUI tools 
built from several common libraries. The libraries are 
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provided by the TeamWare group for use by the 
TeamWare applications; they are not provided for 
more general use. 


TeamWare is a code management product that 
encourages parallel development and is built on top of 
SCCS. A user makes a copy (bringover) of an SCCS 
hierarchy thus creating a personal hierarchy. In this 
hierarchy the user makes and tests changes. These 
changes are then integrated (putback) into the original 
hierarchy. If the integration hierarchy contains 
changes which are not in the user’s hierarchy, then 
TeamWare detects that there have been parallel 
changes and refuses the integration. Therefore, users 
must incorporate changes in the integration hierarchy 
into their own hierarchy before integrating. TeamWare 
also includes the filemerge utility, a graphical 
three-way differences program allowing users to 
merge parallel changes. Team Ware tracks both source 
file changes (SCCS deltas) and file renames. 


1.1 Which Way To Go? 


In the beginning, the group was faced with two paths. 
The first path, marketed by Nike, was labeled “Just Do 
It” and appealed to our impulsive nature. The second 
path, labeled “Crawl Before You Walk”, appealed to 
our logical selves. The Just Do It path called for each 
of us to decide for ourselves which features of the 
language to use and how to apply them. The Crawl 
Before You Walk path called for the group to use new 
features of C++ only as it became apparent that they 
added value over our more well understood C 
techniques. 


1.2 Getting Started 


We began by taking a C++ course taught by Hank 
Shiffman of SunPro Marketing and offered through 


Sun U. and buying a handful of books [Ellis, 
Stroustrup 1990], [Eckel 1990], and [Dewhurst, Stark 
1989]. We found the Annotated Reference Manual to 
be somewhat daunting for the average programmer. As 
a language reference manual it is intended more for 
compiler writers and people interested in a very 
precise language definition. The other two books 
explain how to use the language. Dewhurst and Stark’s 
book is much more concise and was, therefore, the first 
reference. Eckel’s book was used when Dewhurst and 
Stark’s was inadequate. Many C++ books have been 
published since the spring of 1991 so there may well 
be better options available now. 


Initially, some of us felt that C++ would be pretty 
easy to pick up. After all, wasn't it just C with a bit 
more stuff? Hank's class convinced us otherwise. We 
left the class feeling that there was a lot to this 
language, some parts we liked, some parts we didn't 
and much that we didn't fully understand. For 
example, we left the class with the clear message - 
“Stay away from multiple inheritance”. ! 


We chose the Crawl Before You Walk path and 
started with very modest goals; we would use classes 
instead of structures, constructors and destructors and 
member functions. We felt certain that our future 
would include inheritance and virtual member 
functions, but we did not feel ready for them yet. 


2.0 Features We Used 


2.1 Required Function Prototypes 


A function prototype is a function declaration 
containing the function’s return type and the types of 
all its arguments. Function prototypes allow the 
compiler to do strong type checking as the compiler 
ensures that a function is always called with 
parameters of the appropriate types. C++ requires a 
function prototype for every function that is called. 


Initially, we ported some utility code from a 
previous project. It had been written in Kernighan & 
Ritchie (K&R) C. The first task was to change all the 
function declarations from the K&R C style to the C++ 
style, and to declare the function prototypes in the 
header files. This was tedious work, but quickly 
demonstrated the power of requiring accurate function 
prototypes. We kept compiling the files until the 
compiler no longer complained, knowing that, only 
then, did the uses match the definitions. In the long 
run, we found required function prototypes to be the 


1. We did. 


single biggest advantage of C++ over K&R C or even 
ANSI C? 


Our early C++ days were very frustrating with 
respect to the C++ error messages. We found the error 
messages to be obscure and not terribly informative. 
Each of us had experiences of spending several hours 
trying to figure out what the compiler was telling us.° 
Coming from C programmers this is no small 
statement. Over time this problem went away. We 
concluded that the C++ error messages are probably 
not much worse than the C compiler’s, but it took us a 
while to gain the same familiarity with them that we 
have with the C compiler’s. 


2.2 Classes 


Classes are the essence of C++’s object model. They 
are like C structures with the addition of constructors 
and destructors, public and private fields, member 
functions and the ability for one class to inherit from 
another. A class generally consists of the data needed 
to present a certain concept. The member functions are 
routines that operate on that data and form the 
interface to the class. 


2.3 Constructors and Destructors 


Constructors are routines that initialize newly 
allocated objects and destructors are routines that 
clean up before an object is de-allocated. An object is 
created by having the new operator allocate memory, 
and then the constructor initializes the memory. 
Likewise, an object is destroyed by having its 
destructor called to clean up the object (such as closing 
open file descriptors), and then having the delete 
operator de-allocate the memory. The new and 
delete operators replace traditional malloc () and 
free () usage. 


We became fans of constructors and destructors. 
Much of our C code had followed the same principles 
by providing one routine which would allocate an 
instance of the type and initialize its fields, and by 
providing a second routine which would de-allocate 
the appropriate member fields and the instance of the 


2. ANSI C has function prototypes but they are not 
required. ANSI C and C++ interpret the declaration 
void func() differently. In ANSI C, it is a function 
with no argument type checking, while in C++ it is a 
function with no arguments. The ANSI C equivalent is 
void func(void). 

3. A favorite was a message suggesting that “perhaps a 
; Was missing” which was frequently generated when 
there was an extra semicolon. 
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type itself. We were pleased to have language support 
for what we had been doing by hand. 


One aspect of constructors we found annoying is 
that they must be kept very, very simple because it is 
awkward to have a constructor return an error value, 
such as failure to open a file. We kept our constructors 
simple and then added a member function to do things 
which might fail. However, this separates the complete 
construction of an object into two, possibly separated, 
pieces. It is possible to end up with a partially 
constructed object. In one case, we passed into the 
constructor the address of an error variable so the 
constructor could return an error. 


2.4 Function Overloading 


Function overloading allows you to have more than 
one function with the same name as long as those 
functions take different types of arguments. In C, the 
use of a function named foo maps to one and only one 
function defining foo. With function overloading, this 
is no longer true. The reader must take into account the 
arguments passed to foo to correctly map foo to its 
implementation. 


As old C programmers we left the C++ class with 
an uneasy feeling about function overloading. We 
were fairly convinced that use of function overloading 
would be confusing and not prove to be worthwhile. 


However, constructors encouraged us to use 
function overloading. We often found it advantageous 
to provide more than one constructor for a class. 
Frequently, we would provide a very bare-bones 
constructor which initialized all the member fields to 
default values along with additional constructors 
which did more and more sophisticated initialization. 
We found this type of function overloading to be very 
natural and useful. Overloaded constructors probably 
represented about 90% of all our overloaded functions. 
The remaining overloaded functions tended to be ones 
which accepted different numbers of arguments. The 
ones accepting fewer arguments would supply default 
values for the missing arguments and call the one 
accepting the most arguments. Default arguments 
could have been used instead but, since they made us 
nervous, we chose to make the defaulting explicit. 


2.5 Member Functions 


Member functions are a suite of functions associated 
with a class. They can be viewed as providing the 
interface to the class, that is, the set of operations 
which can be applied to instances of the class. A 
member function is called via a pointer to an object or 
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an actual object. Each member function is implicitly 
passed a this pointer, which is a pointer to the object 
through which the call was made. Inside a member 
function the scoping rules change. The member fields 
and functions can be referenced directly, it is not 
necessary to use the this pointer. 


Member functions were a big hit. At this point, we 
were using objects and member functions, but no 
inheritance. Without inheritance, member functions 
are primarily syntactic sugar, but one we took a liking 
to. One of the larger weaknesses of C is that there is a 
single global name space for all functions. In C++, 
each class provides a separate name space for its 
member functions. This results in member functions 
being given less verbose and more descriptive names, 
resulting in more readable code. 


Initially, we found directly referencing member 
fields and functions rather disconcerting as we were 
referencing names which were not declared to be 
either local or global. Furthermore, we sometimes 
declared a parameter with the same name as a member 
field. The compiler did not complain about this either 
and the scoping rules result in the parameter hiding the 
member field. The latter can be a serious problem and 
was the source of several very subtle bugs. 


Classes and member functions cause C++ to have 
more name spaces than C, so the naming conventions 
used in C programs often turn out to be inadequate for 
C++. It would be advantageous to use a naming 
convention which syntactically separates fields from 
parameters and variables. We never completely came 
to grips with this problem. 


2.6 Public, Private, and Protected Fields 


Member fields in a class can be either public, private 
or protected. Public fields can be referenced from any 
object (.) or object pointer (—>); private fields can be 
referenced only from within its class's member 
functions and its friends; protected fields are the same 
as private unless inheritance is used. 


Some kinds of fields are really private to a class. 
These keep track of the internal state of an object and 
users of the class have no need to either read or write 
them. Other fields are of interest to the users of a class. 
Making these private requires that functions be 
provided to get and set the field. We called these 
accessor functions. 


We did not discuss group-wide conventions for 
using public, private and protected fields, and 
consequently two very different styles emerged. The 
people writing the libraries dabbled with private 
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members, but did not find that they added much value 
and eventually took to declaring all new members 
public. The people writing the GUI applications made 
much more significant use of private member fields 
and their corresponding accessor functions. When 
debugging, they found it useful to be able to set a 
breakpoint in a set routine and catch all situations 
where a given field was being set. 


In hindsight, we believe we would have made 
much greater use of private fields if we had been 
providing a public API. Private fields give the 
implementors of an API much greater control over the 
interface by preventing arbitrary access. In our case, 
we were providing an API to ourselves and we did not 
find this level of interface control necessary. 


2.7 Inline Functions 


Inline functions have their bodies expanded at each 
call site. They eliminate the function call and return 
overhead in exchange for duplicate copies of their 
bodies. 


A frequent objection C programmers have to 
accessor functions is - “Why should I have to make a 
function call just to get the value of a field? This will 
be too expensive!”’. Inlined functions provide a very 
nice solution to this problem. They allow the 
implementor of a class to tightly control its interface 
without needlessly sacrificing performance. When we 
used private data, the corresponding get routine was 
almost always inlined. Set routines were frequently 
inlined as well. 


A common question regarding inlined functions is 
- “What size function should be inlined?”. In 
answering this question you must consider the 
performance gain resulting from removing the 
function call overhead, versus a possible increase in 
code size. Someone suggested to our group a rule of 
thumb we think makes sense - do not inline any 
function which contains control structures.* 


2.8 Public, Private, and Protected Member 
Functions 


Public, private and protected apply to a class's member 
functions as well as its fields. Just as for fields, they 
define a member function’s scope. We found we used 
private member functions less often than private fields. 


4. Strict interpretation of this rule would prevent simple 
if statements, however, these can be done with the ?: 
operator. 


As with private fields, we believe we would have 
made greater use of private member functions if we 
had been providing a public API as they separate the 
interface from the implementation. 


2.9 Inheritance and Virtual Functions 


Some of the early code we ported included a list 
package. This took us on our first journey into 
inheritance and virtual functions. /nheritance provides 
for building one class (a derived class) from another (a 
base class). The derived class becomes a superset of 
the base class and has all the base class's functionality 
along with any new functionality provided by the 
derived class. Virtual functions allow the derived class 
to modify the behavior of the base class. If the base 
class contains a virtual function and the derived class 
provides a function with the same name and 
arguments, then the derived class's function 
supersedes the base class's. 


Inheritance and virtual functions encourage 
implementing generic functionality in a base class, and 
then specializing that functionality in each of the 
derived classes. A list package is a good example. 
Much of the implementation of a list package is 
generic and applies to all lists regardless of the 
elements they contain. However, some of the 
implementation is specific to each type of list. Printing 
the elements of a list is an example; walking the list to 
visit each element is generic while the actual printing 
of an element is specific to the element type. 


Many C programmers are initially confused by the 
semantics of inheritance and virtual functions. They 
frequently have trouble determining whether a given 
call will invoke the base class’s function or the derived 
class’s function. Say you have a base class with a non- 
virtual member function then you can call this function 
via an object derived from the base class. If this non- 
virtual function in turn calls a virtual function, does it 
call the base class's function or the derived class's 
function? The answer is obvious to experienced C++ 
programmers”, however it is often confusing to C 
programmers first learning C++. 


We found it much easier to understand these 
semantics by thinking about the implementation. First, 
the C++ compiler will try to resolve as many function 
references as it can at compile time. Any reference to 
a non-virtual function is resolved at compile time. Any 
reference to a virtual function cannot be resolved at 
compile time. Secondly, a class's virtual functions are 
put into a table of virtual function pointers and this 


5. It will call the derived class’s function. 
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table is associated with every instance of that class. In 
the earlier example, when a non-virtual function is 
called from a derived object, a pointer to the derived 
object is passed into the function (the this pointer). 
The virtual function is then found in the this 
pointer’s virtual function table. Therefore, the derived 
object’s function gets called. 


2.10 List Package 


We had two goals for our list package. First, to 
preserve the ability to have lists of things which were 
unaware they were in a list. Second, to have typed lists, 
that is, a list of ints, a list of char *s, a list of 
pointers to class foos, etc., rather than generic lists. 
The first couple of attempts at converting the list 
package were feeble. The interfaces were clumsy, 
there were many friend declarations, and usage 
was awkward. The third iteration settled down into 
what we felt was a pretty reasonable interface and by 
this time all the friend declarations had 
disappeared. We also concentrated on making it easy 
to create lists of new types. 


It has become apparent that object oriented 
languages do not deal well with implementing generic 
container classes. In C++ this is the motivation behind 
templates. However, we were using cfront 2.0 
which predated templates. 


An example of the generic container problem is 
shown by trying to copy a list. Copying a List is 
something the base list class should do as it 
understands the implementation of lists. A virtual 
function should be used to copy an element of a list as 
this allows each derived list to provide its own copy 
function. So far so good, but what type does the base 
class’s copy routine return? That is the dilemma. The 
only type it knows about is the base class, yet this is the 
wrong type to returm since it is copying a typed list. We 
found it necessary to have the base class's copy routine 
return a pointer to the base class and then have each 
derived class also supply a copy routine. The derived 
class’s copy routine then calls the base class’s copy 
routine and casts the return value to a pointer to the 
derived class’s type. 


This same problem occurs for a few other 
functions, and applies to all derived lists. Therefore, 
we wrote a macro which, given the derived list's type, 
generates all the necessary functions. 


Likewise, there are several virtual functions which 
manipulate the elements of a list. The elements are 
stored as opaque types so each of these functions needs 
to cast the opaque type to the appropriate element type. 
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We wrote a second macro to generate these functions. 


We believe that templates would have led to a 
cleaner solution to these problems but have not yet had 
a chance to try them. In the end, we were pleased with 
the list package (and its cousin the hash package). Our 
libraries contain nine different types of lists. Several 
members of the team commented that creating new list 
types was very easy and beneficial. 


2.11 More Inheritance 


We applied inheritance and virtual functions in several 
other places as well. They are very powerful concepts. 
It takes a bit of effort to become comfortable with them 
but their use can generate significant rewards. We 
found that using classes, inheritance, and virtual 
functions resulted in greater code sharing. Obviously, 
this level of code sharing was possible using C but it 
takes much more discipline. With C++, the language 
provided support for these concepts, making it much 
easier to achieve code sharing. 


For those familiar with TeamWare, its 
bringover and putback commands do very 
similar things yet they differ in a few areas. They are 
implemented with a base class called a Transaction 
and derived Bringover and Putback classes. The 
derived classes do things like argument parsing and 
implement the differences between the bringover 
and putback commands while the bulk of the work 
is done in the Transaction class. 


It was common, when a member of the group first 
encountered this implementation, for them to ask - 
“how do I tell if I'm in a bringover or putback 
command? Where is the global variable to test?”. The 
answer would be - “There is no global variable. An 
object knows which one it is so, if it you are doing 
something unique to one of the commands, then it 
should be done in a virtual function.”. This was a new 
way of thinking. 


Sometimes we would create a class not expecting 
it to become a base class only to discover later on that 
we needed to derive another class from it. We then 
faced the question - should all the base class's 
functions be made virtual, or should only those 
functions which our derived classes replace be made 
virtual? We did a little of both with no obvious results 
favoring either technique. We feel this issue comes 
back to whether or not you are providing a public API. 
If so, then you probably want to make your classes 
very flexible and allow derived classes to replace 
many of the functions. This implies that most, if not 
all, public member functions should be virtual. 
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Otherwise, our experience indicates that it really 
doesn't matter. 


2.12 Pure Virtual Functions and Abstract 
Base Classes 


A pure virtual function is a virtual function in a base 
class for which no actual function is defined. A pure 
virtual function is declared by putting = 0 at the end of 
its declaration.® If a class contains a pure virtual 
function, then it is an abstract base class. The 
significance is that the compiler will not allow you to 
have any instances of an abstract base class. 


Abstract base classes should be used when the base 
class is so generic that it is not useful just by itself. 
Only when some key functionality is supplied by a 
derived class does it become useful. Our list package 
was an example of an abstract base class. With typed 
lists, an instance of the base list class is not 
meaningful. Without pure virtual functions, the base 
class's implementation of these functions would 
probably consist of printing a nasty message and then 


exiting. 


2.13 Operator Overloading 


Operator Overloading is the ability to redefine the 
basic C++ operators. TeamWare’s applications did not 
lend themselves to needing operator overloading. We 
only overloaded the new and delete operators for 
some classes so that we could impose our own 
memory management. This was very useful and 
allowed us to elegantly gain significant performance 
improvements. 


General operator overloading seems like a feature 
that is likely to be abused. During the C++ class, Hank 
warned us to never overload an operator to do 
something entirely different than the operator's normal 
semantics. This is very reasonable advice. If you have 
an object and doing things like adding two of them 
together makes sense, then operator overloading may 
be the way to go. However, we would recommend that 
it be used with caution. 


2.14 Calling C Routines From C++ 


It is common to need to call C routines from C++ as 
many libraries, such as 1ibc, contain routines written 
in C. At first glance, this wouldn’t appear to pose any 


6. The syntax is abysmal. An = followed by 0 is a pure 
virtual function. An = followed by anything else is an 
error. 


problems. However, since C++ allows function 
overloading, it is forced to perform name-mangling on 
function names. That is, if you have three different 
functions named foo, then C++ has to invent a unique 
name for each of them. External routines written in C 
will not have mangled names, so C++ allows you to 
indicate that a given function is a C routine and that its 
name should not be mangled. This is done by 
preceding a declaration with extern "C"’. 


The extern "C" declarations provide a nice 
mechanism and, when needed, are absolutely crucial. 


2.15 Calling C++ Routines from C 


Calling C++ routines from C is an entirely different 
matter. There are two ways to call external C++ 
routines from C. You can either deduce the routine’s 
mangled name and call it, or you can define the global 
C++ routine to be extern "C" and defeat the name 
mangling. 

It is more challenging to call C++ member 
functions from C because you don’t have objects in C. 
Say you have a C++ class and a corresponding 
structure in C. To call a member function, you would 
need to write a wrapper routine in C++ which takes the 
structure as a parameter, converts it to an object and 
then calls the appropriate member function. The 
wrapper routine must also be prepared to convert any 
return values. 


We called global C++ routines from C and did so 
by deducing the mangled names®. This was the wrong 
way to do it. We were unaware of the extern "C" 
technique until Hank pointed it out while reviewing 
this paper. We never tried to call C++ member 
functions directly from C. 


Having mangled names in our C sources leaves us 
at the mercy of the C++ compiler. There is no 
guarantee that all C++ compiler’s will mangle names 
in the same way or that a given C++ compiler will 
always use the same technique. This creates both 
portability and maintenance problems. The extern 
"C" approach is the civilized technique. 


2.16 Comments 


C++ introduces a new syntax for comments, // is a 
comment until the next newline”. 


7. This is strange syntax. "C" is not a keyword as much 
as a key-string-literal. 

8. We did this with nm. 

9. What does this have to do with adding objects to 
C? 
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This is another topic that we never discussed 
amongst the group. Consequently, some people chose 
to use // and others /* */. Furthermore, we used 
gxv++ and it generated code with // comments. 
Ultimately we ended up with some files using //, 
some using /* */ and, worse yet, some using both. 


There is clearly no right or wrong answer here. 
However, it would have been preferable if we had all 
used the same style. 


2.17 set_new_handler() 


The new operator uses malloc () to allocate a new 
object. If it is unable to allocate memory, then it returns 
a NULL pointer. The C++ library routine 
set new handler() allows you to register a 
function to be called when the new operator fails. 


set new handler () provides a nice way to 
intercept malloc () failures within the built-in new 
operator. The alternative is to check the return value of 
every call to the new operator. We used 
set new handler () to register one routine for 
command line programs and a different routine for 
GUI programs. The command line routine printed an 
error message and exited and the GUI routine 
displayed a pre-allocated notice and then exited after 
the notice was dismissed. 


3.0 Features We Chose Not to Use 


Until now, this paper has described the features of C++ 
we used and our experiences with those features. This 
section describes the features we chose not to use and 
why. 


3.1 Multiple Inheritance and Virtual Base 
Classes 


Multiple inheritance allows a derived class to inherit 
from more than one base class. Each base class can 
also be the product of multiple inheritance, creating an 
inheritance DAG. A class derived via multiple 
inheritance exports an interface which is the union of 
the interfaces exported by all its base classes. The 
inheritance DAG can include the same base class more 
than once. In this situation, virtual base classes control 
whether or not the derived class has just one or many 
instances of the base class. 


When considering inheritance the is-a versus has- 
a relationship is crucial. If a derived class is one of 
another class, then inheritance is proper; however, if 
the derived class has one of a another class, then 
aggregation is proper. For example, an editor window 
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would derive from a generic window class because, it 
is a window. An editor window would also have a font, 
but it would not inherit from the font class. Rather, it 
would contain an instance of the font class or a pointer 
to an instance. For multiple inheritance to be proper, a 
derived class needs to be one of several other classes. 
We never encountered this situation. 


The C++ books we had were little help either. 
Despite Waldo’s [1991] protests to the contrary, we 
agreed with Cargill [1991] that the multiple 
inheritance examples in the books were contrived and 
could have been more cleanly expressed with 
aggregation. 

We found single inheritance easy to understand 
and use and extremely powerful. However, we found 
multiple inheritance to be very complicated and 
confusing. With multiple inheritance, the reader of a 
class definition must assimilate the entire inheritance 
DAG complete with virtual and non-virtual base 
classes to understand the class. Multiple inheritance 
creates an awkward ambiguity when two base classes 
have a member with the same name and the name 
resolution rules in a complicated inheritance DAG are 
extremely difficult. 


More than any other feature in C++, multiple 
inheritance appears to be a large wart on the language. 


3.2 Reference Parameters 


Reference parameters allow you to pass parameters by 
reference rather than by value. There are some 
situations in which reference parameters are highly 
desirable, namely operator overloading, where they 
clean up an otherwise very ugly syntactical situation. 
There are other cases where they can cause confusion 
such as for base types. Reference parameters are an 
attempt to clean up some of the notation associated 
with pointers in C. In C, if you have a structure and 
want to pass a pointer to it as a parameter you must 
take the structure's address with the & operator. 
References let the compiler do this for you. 


Linton [1993] contends that reference parameters 
aid in storage management. Typically, a local object is 
passed as a reference parameter. Therefore, the 
receiving function is promising not to store the 
parameter’s address in a global data structure. 


More fundamentally, a C++ programmer must 
decide if their basic programming model is pointer 
based or object based, that is, do they have pointers to 
objects or just objects. This is complicated by the fact 
that objects can be declared globally and locally, yet 
the new operator returns a pointer to an object, not the 
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object itself. Reference parameters assume that you 
are object based, not pointer based, however, this 
makes it more awkward to use the new operator. Most 
real applications use the new operator because most 
applications need to dynamically allocate objects. 


Since we were all experienced C programmers, we 
were comfortable with the notation associated with 
pointers. We had also developed a programming 
model in our heads which reference parameters 
change. For example, in C, parameters cannot be 
changed in a way which affects the calling function. 
Early in our development someone used a reference 
for an int parameter. Another person was reading the 
code and thought he had found a bug because the int 
was being passed to a routine and the next line clearly 
assumed the int's value had changed. We all knew an 
int passed by value could not be changed in the caller 
so this must be a bug. However, since the caller 
declared a reference parameter, the compiler was 
actually passing in the address of the int, so the 
parameter was being modified by the call. We avoided 
references because of that incident, because our code 
made heavy use of the new operator, and because we 
did not use operator overloading. There are some 
tricks which old dogs can't learn. 


3.3 Friends 


Friends allows you to specify that a function or an 
entire class can access a Class’s private member fields 
and functions. 


We observed that friends are generally used when 
interfaces are not cleanly defined. They can be thought 
of as the casts of classes, that is, a way to circumvent 
the language’s built-in safeguards. We were not 
successful in completely avoiding friends, but we 
strongly suspect that the two places we used them 
signify flaws in our interfaces. We discourage the use 
of friends. 


3.4 I/O Streams 


C++ provides an J/O Streams package as a 
replacement for C’s stdio package. Stdout is 
represented by the cout object and output is done 
with the overloaded << operator; stdin is 
represented by the cin object and input is done with 
the overloaded >> operator. 


We took an initial dislike to this package. We were 
perfectly comfortable with print f, we were told that 
I/O Streams have somewhat worse performance than 
the stdio package, and we did not care for the 
syntax. Actually, this package violates one of the 


axioms for operator overloading that Hank told us 
about - “Never overload an operator for something 
other than its normal purpose”. This package 
overloads the shift operators for input and output. 
Deciding to not use C++'s I/O streams was probably 
one of the quicker decisions we made. 


3.5 Operator Overloading 


As mentioned above, we overloaded the new and 
delete operators for some classes. Otherwise, we 
did not use this feature. We felt that operator 
overloading could be a very seductive feature, as you 
can somewhat create your own language. This 
temptation should be avoided. There are probably 
some very good situations for operator overloading, 
but care should be taken to ensure they are not used 
gratuitously. 


3.6 Default Arguments 


Default arguments allow a function call to be missing 
its trailing arguments and they will receive default 
values. Default arguments are essentially a short-hand 
for writing several overloaded functions. 


One person in our group used a few default 
arguments and liked them. The rest of us never felt 
they were necessary and in similar situations wrote the 
overloaded functions. 


3.7 Local Declarations Anywhere 


C requires that local variables be declared at the 
beginning of statement blocks. C++ lets you declare 
variables anywhere.!° For example, if an int is used 
just in a for loop, then the for loop can be written: 


ror (int 2 = OF. 


We found very little practical value to this feature. 


4.0 Features We Didn't Think About 


C++ is a large, complex language. The following are a 
set of features that we really didn't think about and 
therefore did not use. |! 


4.1 Global Objects 


Global objects are instances of classes with a global 
scope. They must be initialized via a constructor 


10. What does this have to do with adding objects to C? 
11. Since we didn’t think about them, this list is probably 
incomplete. 
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before main () is called. C++ sets up the executable 
this way, however, the order of invocation of the 
constructors is not defined. So, if one global object's 
initialization depends upon another there are potential 
ordering problems. 


We did not have any global objects, in part because 
we used a pointer based programming model, so we 
never encountered the ordering problem. However, 
another group in SunPro did encounter this problem 
and warned us about it. 


4.2 Copy Constructors 


Copy constructors are used when an object is copied. 
For example, if a class has a char * field where each 
instance has its own copy of the string, then, when this 
object is copied, the char * field should be duplicated 
rather than just copying the pointer. Copy constructors 
provide this mechanism. 


This is another situation which the pointer based 
programming model seems to avoid. 


4.3 Static Member Fields 


Static member fields are fields which are shared across 
all instances of an object. They can be considered 
global data which is scoped by a class, that is, they can 
be accessed only through an instance of a class. 


4.4 Static Member Functions 


Static member functions are member functions which 
are not passed a this pointer. Like static member 
fields, they can be considered global functions which 
are scoped by a class. 


4.5 Const 


Const is new to ANSI C and to C++. Unfortunately it 
is somewhat different in the two languages. We were 
coming from a K&R C world without consts. 


We never gave consts much thought. Looking back 
we do not recall any situations in which consts would 
have saved us. Still, they seem like a reasonable 
feature, especially in interfaces to class libraries. If an 
interface rigorously uses consts in its declarations, 
then users can know which parameters may be 
modified by which routines. This looks like another 
feature which is more valuable when producing a 
public API. 
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5.0 Conclusions 


The TeamWare group felt that using C++ was an 
advantage and that using C++ helped reduce 
development time and increase the quality of our code. 
We found the single largest advantage to be required 
function prototypes. We converted two existing C 
programs!” to C++ and in both cases function 
prototypes found one or two latent errors. Required 
function prototypes eliminate an entire class of errors 
which occur in C and give programmers much more 
confidence. However, function prototypes alone are 
not sufficient reason to use C++ as an ANSI C 
compiler which enforced the use of prototypes would 
be a more sensible choice. 


We were very pleased with C++’s object model. 
This includes classes, constructors, destructors, single 
inheritance, member functions, virtual functions and 
abstract base classes. We found these features of the 
language easy to adopt and easy to understand. C++’s 
object model encourages the good programming 
practices of modularization, well defined interfaces 
and code sharing. The object model provides valuable 
functionality which sets C++ apart from ANSI C. 


One significant disappointment with C++ is that it 
does not separate a class’s interface from its 
implementation. Abstract base classes are nice as they 
provide an interface specification. However, if the 
class also has private member functions, they must be 
defined along with the public ones. If a private 
member function is added to a class, its interface has 
not changed but, in the world of make, the header file 
has changed and all files including that header will be 
recompiled. 


Unfortunately, the object model includes multiple 
inheritance and, consequently, virtual base classes. 
Most examples of multiple inheritance found in C++ 
books are contrived and should be replaced with 
aggregation. They do not provide good models to 
follow. Multiple inheritance adds complexity to a 
program. It should only be used if the is-a relationship 
is satisfied for each inheritance and if the result is 
simpler than aggregation. We believe that groups 
which follow this advice will rarely, if ever, use 
multiple inheritance. 


Second only to multiple inheritance is the decision 
regarding a pointer based programming model versus 
an object based programming model, or whether or not 
to use reference parameters. C programmers tend to 
gravitate towards the pointer based programming 
model as it is familiar to them. While we favored the 


12. filemerge and make. 


23 


24 


pointer based model, it is more important that a 
decision be made and that the decision be applied 
consistently throughout the project. In the case of 
public APIs it might be appropriate to supply both 
pointer and reference interfaces as this allows users of 
a library to choose their own style. 


We chose to use a fairly small subset of C++ as we 
found it had an over abundance of features. 
Unfortunately, the ANSI C++ committee is not done 
adding to the language. Exceptions and templates were 
added after we started our project. The language is still 
growing and looks like it will include run-time type 
information (RTTI), name spaces and additional cast 
operators. We fear that as C++ grows it becomes less 
usable. This may result in C trying to adopt a useful 
subset of C++ features. The ANSI C committee has 
voted to reconvene itself and has received a proposal 
to add classes with single inheritance [Jervis 1993]. 
While we welcome these features into C, we fear the 
war between C and C++ that may result. 


We are convinced that incrementally adopting C++ 
features and making conscious decisions about which 
features to use and how to use them was the right thing 
to do. If anything, we should have discussed more 
thoroughly some of the features of the language, such 
as private and public members and comments. 


One final anecdote regarding the earlier story 
where someone used an int reference parameter. In 
discussing this paper, the programmer's comment was 
- “Well, the feature was in the language so I figured I 
should use it.”. It is our belief that this is not a 
sufficient criteria for using a feature of C++. A feature 
should be used only when it can be demonstrated to be 
of benefit. A mountain is climbed “because it is there”. 
The same should not hold true for C++ features. Their 
mere existence is not justification for use. 
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6.0 Summary 


Below is a table of C++ features along with our 
assessments: 





























Feature We Used Comments 











Most valuable feature 
Second best feature 
Well done 

Good programming practice 
Good name space scoping 
Promotes code sharing 
Powerful; promotes code sharing 
Ugly syntax; valuable 

Ugly syntax; valuable 

Used mostly for constructors 
Very nice 

Strange syntax; nice feature 
Why? 

Very convenient 

Did not use effectively 

Did not use effectively 

Used only for new and delete 
Very awkward 






Function Prototypes 
Objects 
Classes 
Constructors/Destructors 
Member Functions 

Single Inheritance 

Virtual Functions 

Pure Virtual Functions 
Abstract Base Classes 
Function Overloading 

Inlined Functions 

Calling C from C++ 
Comments 

set_new_handler() 

Private or Protected Fields 
Private or Protected Functions 
Operator Overloading 

Calling C++ from C 


Multiple Inheritance Too complicated 

Virtual Base Classes Too complicated 

Reference Parameters Old dogs can’t learn new tricks 
Friends Indicative of problems 

I/O Streams Saw no advantage over stdio 
Default Arguments Used function overloading instead 
















































Local Declarations Anywhere Why? 

Global Objects Did not need them 

Copy Constructors Used a pointer-based model 
Static Member Fields Provides tighter scoping 
Static Member Functions Provides tighter scoping 
Const Good for public APIs 





EST UNE aS 
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Key Management in an Encrypting File System 


Matt Blaze 
AT&T Bell Laboratories 


Abstract 


As distributed computing systems grow in size, 
complexity and variety of application, the problem of 
protecting sensitive data from unauthorized disclo- 
sure and tampering becomes increasingly important. 
Cryptographic techniques can play an important role 
in protecting communication links and file data, since 
access to data can be limited to those who hold the 
proper key. In the case of file data, however, the rou- 
tine use of encryption facilities often places the orga- 
nizational requirements of information security in 
opposition to those of information management. 
Since strong encryption implies that only the holders 
of the cryptographic key have access to the cleartext 
data, an organization may be denied the use of its 
own critical business records if the key used to 
encrypt these records becomes unavailable (e.g., 
through the accidental death of the key holder). 


This paper describes a system, based on cryp- 
tographic "smartcards," for the temporary "escrow" 
of file encryption keys for critical files in a crypto- 
graphic file system. Unlike conventional escrow 
schemes, this system is bilaterally auditable, in that 
the holder of an escrowed key can verify that, in fact, 
he or she holds the key to a particular directory and 
the owner of the key can verify, when the escrow 
period is ended, that the escrow agent has neither 
used the key nor can use it in the future. We describe 
a new algorithm, based on the DES cipher, for the on- 
line encryption of file data in a secure and efficient 
manner that is suitable for use in a smartcard. 


1. Introduction 


Modern distributed computing systems, for all their 
virtues, make it difficult to limit reliably access to 
sensitive data. Networks often unselectively broad- 
cast data to far-reaching and unpredictable places, 
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remote login facilities create new opportunities for 
trespassers and distributed file systems often assume 
that all machines to which they provide service are 
trustworthy and reliable. To reduce these risks, cryp- 
tographic techniques make it possible to limit data 
access while still taking advantage of untrustworthy 
networks and services. Modern workstations can 
encrypt in software at close to network speeds [4][5]. 
Data encryption attempts to ensure that only those 
who possess the correct decryption key can obtain the 
cleartext data. 


Most commercial applications of encryption 
techniques protect communication links (and related 
services such as electronic mail). When communica- 
tion endpoints are under the control of a single entity, 
or trust a common authority, the management of 
cryptographic keys is a conceptually straightforward 
matter. Keys can be assigned and changed as often as 
desired, the main problem being to ensure that both 
sender and receiver agree as to the current keys and 
that keys are discarded when no longer in use. 
Should sender and receiver get "out of sync" with the 
keys, the problem becomes immediately apparent 
because communication fails. Ensuring access by 
third parties in the event that keys are lost or unavail- 
able is rarely an issue.* Public key techniques 
[3][10] make communication key management easier, 
allowing two parties to establish a secure channel 
without prior arrangement. 


*The law enforcement community argues that it may be an 
exception; widespread use of encryption techniques may impede 
police wiretap investigations [2]. The ethical, legal, social and 
technical implications of law enforcement access to cryptographic 
communication are presently the subjects of intense public debate 
in the United States and are (fortunately) outside the scope of this 


paper. 
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Cryptography can also be used to protect file 
data, although there are relatively few tools for this 
purpose in widespread use. Most file encryption 
takes place at the application level, with tools such as 
the Unix crypt command or with special encrypting 
applications (e.g., "vi -x"). File encryption can also 
take place at a lower level, as a basic service of the 
file system [1][9][13]. 


Regardless of where encryption takes place, 
key management for encrypted files is a fundamen- 
tally different problem from that in cryptographic 
communication. In a secure communication system, 
keys must be distributed and synchronized geograph- 
ically. Keys often serve the dual purpose of authenti- 
cating identity as well as protecting against eaves- 
droppers. The architecture for distributing communi- 
cation keys is closely tied to the trust relationships 
within the system, and practical key distribution pro- 
tocols (such as those employed by the Kerberos sys- 
tem[12]) must be carefully engineered to balance reli- 
ability, security and performance. 


In a file system, on the other hand, there is usu- 
ally little need to distribute keys geographically; most 
protected files are encrypted and decrypted at the 
same locations (and by the same users). Authentica- 
tion of identity is a less serious issue, with access 
implicitly controlled through knowledge of the key 
itself, although cryptographic techniques can also be 
used to detect unauthorized tampering with file data. 
File systems still present a significant, if differently 
formulated, key management problem, however, in 
that keys can be said to be distributed temporally. 
The corresponding keys must be available at both 
encryption and decryption time. File encryption keys 
have much longer lifetimes than their communication 
counterparts. If a key is lost or unavailable, the files 
encrypted with it are rendered useless. This condi- 
tion may not be detected until it is too late. The key 
distribution center and public key cryptographic pro- 
tocols developed for geographically distributed com- 
munication systems do not have direct analogues that 
can be readily applied to temporal file key manage- 
ment. 


Arguably, it is because of difficulties associated 
with key management that sensitive files are rarely 
encrypted in practice even when encryption tools are 
available. This is especially true in critical business 
environments where ensuring the availability of data 
to authorized users is at least as important as ensuring 
its unavailability to everyone else. Sometimes, files 
are protected with weak ciphers, such that the 
encrypted data can be recovered with the application 
of sufficient computing resources. A toolkit ("Crypt 


Breaker’s Workbench") is available in Internet 
archives for the purpose of decrypting files enci- 
phered with the Unix crypt program. Needless to 
say, since these tools are also available to the adver- 
sary, encryption with weak ciphers is of questionable 
value in the first place. 


In the context of organizational information 
systems, cryptographic file protection presents sev- 
eral problems not addressed by traditional (communi- 
cation-oriented) key management schemes. These 
problems are not only technical (e.g., providing 
mechanisms for ensuring that keys are available 
when and where authorized) but also managerial and 
social (balancing secrecy and privacy against emer- 
gency access requirements). Carefully controlled key 
management services with explicit, auditable trust 
relationships that are integrated into the underlying 
file system security architecture can help reconcile 
these often conflicting goals. 


2. Key Escrow 


Hence the problem: strong file encryption is often 
necessary to protect privacy while availability 
requirements sometimes dictate the need for a "back 
door" for emergency access. We use as our model 
the common problem of ensuring continued access to 
critical business files even after the only employees 
who know the keys to those files leave the organiza- 
tion. One approach adapts the procedures used for 
controlling physical locks and keys to file encryption 
keys and provides a central key distribution ("lock- 
smith") service. Any time a user requires an encryp- 
tion key, it is generated by a central service, which 
also keeps a copy for emergency access. 


In practice, however, the central locksmith 
model adapts poorly to large-scale file encryption key 
management. The central service must be uncondi- 
tionally trusted by all who obtain keys from it. No 
further controls preclude or audit access by those 
with access to the key database. (Note that this is not 
the case with locksmiths who manage physical keys 
— use of a key requires access to the lock, which 
may itself be controlled by independent security 
mechanisms and which can be changed if the lock- 
smith’s office is compromised. In the case of file 
keys, on the other hand, once a copy of the key 
database has leaked, all files with keys in the 
database must be considered compromised forever.) 
Furthermore, a central service can quickly become a 
service bottleneck or worse, a single point of failure 
or attack. The key service is an "online" part of the 
key creation process and users cannot create new 
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keys if the service is unavailable. Finally, the prob- 
lem of securing communication between the user and 
the key center introduces all the problems of commu- 
nication key management in addition to the existing 
problem of file key management. 


An alternative approach reverses the relation- 
ship and provides a controlled mechanism for users 
to deposit copies of their keys for emergency use as 
needed. The keys for crucial files could thereby be 
"escrowed" with a trusted caretaker who would 
reveal them only when certain conditions are met, 
such as when encrypted business data are required 
after the death of the legitimate key holder. Concep- 
tually, keys might be delivered within sealed 
"envelopes." When a set of files is no longer critical, 
the envelope containing its keys could be returned to 
its originator, who could verify the integrity of the 
seal and destroy the keys, preventing future access to 
outdated, but still private, data. The "escrow-deposit" 
approach has the benefit of allowing the key holder to 
generate keys in the usual manner, without direct 
"online" interaction with a third party. There is no 
central service bottleneck, since the escrow agent is 
not directly involved in the creation of new keys. 
Envelopes containing escrowed keys can be delivered 
to the escrow agent at any time and any inability to 
deliver the keys to the agent need not preclude their 
use by the key holder. 


Unfortunately, this is difficult to do in practice. 
The simplest procedure has the key holder write 
down the key, place it in a sealed envelope, and leave 
it with a trusted caretaker. This is vulnerable to mis- 
takes, however, since there is no inherent mechanism 
to ensure that the escrowed key is the same as the real 
one. The security of the scheme also depends 
entirely on the honesty of the caretaker and the 
tamper-resistance of the envelope. An electronic ana- 
logue to the sealed envelope can be implemented by 
encrypting the key with a "caretaker" key, perhaps 
using public key techniques. If this is done automati- 
cally as part of key generation, the problems associ- 
ated with transcription mistakes are avoided, but the 
scheme still depends entirely on the caretaker’s hon- 
esty (and even more so without the sealed envelope). 
If no single caretaker can be trusted, the key could be 
multiply encrypted with more than one caretaker’s 
key, split among several escrow agents (in the man- 
ner of the US Escrowed Encryption Standard) or 
encrypted using a group-oriented public key protocol. 


Both the manual and encrypted key escrow 
schemes suffer from a fundamental problem, how- 
ever. After an escrow agent "opens" the key and 
learns its value, no further controls on its use are 
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possible. Anyone who learns the keys can use them 
at any time in the future without detection. Elec- 
tronic escrow is particularly hard to revoke or audit, 
since it is difficult to ensure that all copies of the keys 
have been destroyed when the escrow period ends 
even if the keys have never actually been used (con- 
sider backups or illicit copies of the escrow data). 


Under these schemes, key escrow is an “all or 
nothing" proposition, with no mechanism to guaran- 
tee, in any formal sense, that the caretaker is doing 
his or her job honestly. It is not obvious how to 
implement key escrow schemes that offer stronger 
protection against abuse without relying on elaborate 
physical access controls or special purpose hardware. 


Cryptographic smartcards can be used to 
implement more carefully controlled and fully revo- 
cable file system key escrow. Smartcards have sev- 
eral properties that lend themselves to use as a con- 
trolled store for escrowed keys. These cards are 
designed to be sufficiently tamper-resistant to allow 
their use in financial applications, have a controlled- 
access non-volatile memory, can run general purpose 
software and include built-in cryptographic and ran- 
dom number generation capabilities. 


3. Smartcard-Based Key Escrow in a Crypto- 
graphic File System 


The shortcomings of entirely software-based key 
escrow schemes arise out of the inability to control 
the use of the key once it has been revealed to the 
escrow agent. Thus the problem is to guarantee the 
escrow agent use of the key without actually reveal- 
ing what it is. While this may appear to involve 
impossibly contradictory requirements, most com- 
mercial smartcards can be adapted to serve exactly 
this purpose. 


We propose a system in which an "escrow 
smartcard" can be created along with each file 
encryption key. This card is provided to a designated 
third party (the "escrow agent") who is authorized to 
use the key under some well-defined set of circum- 
stances. If emergency access is required the card can 
decrypt files without revealing what the key is, acting 
as a self-contained decryption engine for ciphertext 
sent to it by the escrow agent. Any time the card 
decrypts data it also records that fact in its secure 
storage. Later, when the escrow period is terminated 
or when an audit is to be performed, the user can 
query the card to determine whether the escrow agent 
has used it. This section describes the design and 
implementation of a smartcard-based key escrow 
scheme for CFS, a file encryption system for Unix. 
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CFS is a cryptographic file system interface for 
Unix-like systems; it allows the user to associate 
cryptographic keys with directories. It runs entirely 
on the client workstation. No modification to the 
underlying file system (or file server) is required, and 
file contents as well as some meta-data (file names) 
are cryptographically protected. Backups and other 
such routine administrative services can take place in 
the normal manner and without the encryption keys. 
Details on CFS can be found in [1]. 


Basically, CFS provides a mechanism to asso- 
ciate "real" directories (on other file systems) that 
contain encrypted data with temporary "virtual" 
names through which users can read and write cleart- 
ext. These virtual names appear in a separate names- 
pace under the CFS mount point, which is usually 
called /crypt. Users create encrypted directories 
on regular file systems (e.g., in their home directo- 
ries) using the cmkdir command, which creates the 
directory and assigns to it a cryptographic 
“passphrase” that will be used to encrypt its contents. 
To use an encrypted directory, it must be "attached" 
to CFS using the cattach command, which asks for 
the passphrase and installs an association between the 
"real" directory and a name under /crypt. Cleart- 
ext is read and written under the virtual directory in 
/crypt, but the files are stored in encrypted form 
(with encrypted names) in the real directory. When 
the directory is not in use, the association is removed 
with the cdetach command, which deletes the cleart- 
ext virtual directory under /crypt. When CFS is 
run on a client workstation, the cleartext data (and the 
cryptographic key passphrase) are never stored on a 
disk or sent over a network, even when the real direc- 
tory is located on a remote file server. The system is 
implemented as a user-level NFS[11] server. The 
basic flow of data in CFS is shown in Figure 1. 


Key escrow is implemented for CFS as an 
option to escrow the key when the encrypted direc- 
tory is created with cmkdir. When keys are initially 
assigned and whenever escrowed access is required, 
the machine running CFS must have a smartcard 
reader-writer attached. (In day-to-day user operation 
on encrypted files, no smartcard reader is required.) 
The smartcard has a small store of secure memory, 
the ability to run simple programs securely and a 
secret-key cryptographic engine compatible with that 
of the host file system. Ideally, the card could have a 
real-time calendar and the ability to schedule execu- 
tion at some future date, although the cards we use 
(the AT&T smartcard) do not have these capabilities. 
We call the user who created the files the "owner" 
and the caretakers of the escrowed keys the "escrow 
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agents." The techniques described here could be 
applied to any file encryption system and are not lim- 
ited to CFS. 


At the time keys are assigned (e.g., with the 
CFS cmkdir command), the smartcard is initialized 
with three sets of cryptographic keys. The first key 
set, the "file system key," is used for actual file data 
encryption, and consists, in CFS, of two 56 bit 
DES[6] keys derived from a user-selected secret 
"passphrase." The file key is also used to hash a 
known plaintext string that is stored in the host file 
system in the "check file." The second key, the "audit 
key," is used to post-audit the card at escrow revoca- 
tion time and will be explained in more detail below. 
The audit key is also stored in a file on the host com- 
puter (encrypted under the file keys). The last key, 
the "escrow key," is used to encrypt the file system 
keys stored on the card. It must also be provided to 
the escrow agent (perhaps via public key techniques, 
and perhaps split among several agents, but this key 
is not essential to the security of the protocol). Ordi- 
narily, the escrow key is derived from a second 
passphrase entered by the owner. The encrypted file 
keys and audit key are maintained in secure storage 
on the card and cannot be easily "reverse engineered" 
from the card. All smartcard initialization takes place 
in CFS through a modified version of the cmkdir 
command. 


Once keys are assigned, the smartcard is turned 
over to an escrow agent for safekeeping and the 
escrow key passphrase revealed to the escrow agent. 
(The escrow agent who holds the card need not be the 
same agent who knows the escrow key). If the smart- 
card has a calendar and the ability to schedule future 
execution, the escrow data on the cards could be con- 
figured to automatically self-destruct after a set 
period. If needed, duplicate cards, with new escrow 
and audit keys, can be created by the owner (using 
the file passphrase) at any time. 


In normal CFS operation, the file system keys 
are derived from the user passphrase on the trusted 
host computer when the owner issues the "cattach" 
command for an encrypted directory; the smartcard is 
not involved. Regular user operation requires only 
the standard version of CFS (without any escrow 
software). The check file assures that the entered 
phrase is valid and that wildly incorrect decrypted file 
names and contents are not returned to the file sys- 
tem. 


The smartcard itself is used to perform three 
operations. The first operation, "pre-audit," simply 
verifies to the escrow agent that the keys on the card 
correspond to those used to encrypt the actual file 


system. In this mode of operation, the escrow agent 
sends the contents of the check file (in the escrowed 
file system) and the escrow key to the smartcard, 
which provides a "yes" or "no" answer based on the 
decrypted file keys. (The owner could "cheat" and 
provide a "dummy" check file; we discuss this 
below.) The escrowed keys do not leave the card. 


In "escrow access" operation, the smartcard 
decrypts files for the escrow agents. The agents sup- 
ply the escrow key; if it is supplied correctly, the card 
decrypts the file system keys and increments a 
counter in its secure store. Thereafter, for the remain- 
der of the session, the card will use the decrypted file 
keys to decrypt file data sent to it. If the card has a 
real time clock, it could also maintain two time 
stamps for the first and most recent times the escrow 
key was used. Again, the keys never leave the card; 
the card acts as a wholely self-contained decryption 
engine. Once the card is removed, its state is reset 
and the escrow key must be supplied again to enable 
further decryption. Escrow access in CFS takes place 
through a modified CFS file system daemon in which 
the crypto engine is replaced with calls to the smart- 
card interface. Additional support tools supply the 
escrow key to the smartcard. Note that the card inter- 
face is part of the data path for all decrypted data. 
The data flow is shown in Figure 2. 


The last mode of operation, "post-audit," is 
used when the escrow period is ended and the card is 
returned to the owner. The card reports the number 
of times the escrow keys were used. If the card has 
the capability to store this data it could also report the 
first and last access times and number of bytes 
decrypted under escrow (again, our cards do not). To 
help protect against card forgery and to safeguard 
against the return of a fake card by the escrow agent, 
the owner can challenge the card to perform encryp- 
tions under the audit key. The audit key is decrypted 
on the host computer with the owner passphrase; by 
comparing the results of a random challenge with the 
result of a decryption performed locally, the owner 
can verify that the card that was returned is the same 
one that was originally escrowed. Post audit is per- 
formed in CFS with an additional user tool. 


3.1. File Encryption Scheme 


One of the lessons learned from the design of CFS is 
that the problem of encrypting files on-line in a file 
system is somewhat different from other kinds of 
encryption problems. No single standard encryption 
mode[7] has all the properties required for file system 
use; further compounding the problem are concerns 
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Figure 2 — Data Flow in CFS Escrow Agent System 





that the 56 bits of key used by the DES cipher are 
vulnerable to exhaustive search attack[14]. 


CFS uses a combination of DES "codebook" 
and pre-computable "stream" cipher modes to 
approximate the strength of multiple iterations of 
DES with the runtime latency of only a single itera- 
tion of DES. This scheme has the resistance to struc- 
tural analysis of a chaining cipher but allows random 
read and write access in constant time. 


The encryption scheme relies on the ability to 
trade off space (in the precomputation of the streams) 
for time. To accommodate the key escrow system, 
we modified the CFS encryption scheme to allow 
“lazy evaluation" on the smartcard without the large 
memory requirements of the precomputed stream. 
We believe this scheme to be equivalent to 3-DES 
under currently known practical: attacks. The new 
CFS cipher is as follows: 


Recall that keys in CFS consist of two DES 
keys, K, and K>, derived from the user passphrase. 
Conceptually, CFS file block encryption consists of 
encryption against a positionally defined stream 
cipher derived from Kj, which is then encrypted with 
a codebook block cipher under K>, which is further 
encrypted with a second multi-use stream cipher 
derived from K,. Specifically, 


E, = DES'(K2, D, ® DES'(K,, f(p mod m)) @ i) 


® DES'(K,, g(p mod m)) (1) 
The cipher is reversed in the obvious manner: 


D, = DES"'(K, E, ® DES'(K,, g(p mod m))) 


® DES'(K,, f(p mod m)) ei (2) 
where: 
E, _ is the ciphertext block of a file at byte offset p. 
D, is the cleartext block of a file at byte offset p. 


® is the bitwise exclusive-or operation. 


DES'(k, b) 
is the Data Encryption Standard block encryp- 
tion function on cleartext b with key k. 


DES™'k, b) 
is the Data Encryption Standard block decryp- 
tion function on ciphertext b with key k. 


i is a bit representation of a unique file identifier 
derived from the Unix inode number and cre- 
ation time of the encrypted file. 
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f(n), 8”) | 
are publically known functions that map an 
integer representation n into unique bit strings 
of the DES codebook size (64 bits). 


m is the length of the precomputed stored stream 
(presently 256K bytes). 


Observe that the stream ciphers defined by 
DES'(K,, f(p mod m)) and DES'(K,, g(p mod m)) 
can be precomputed for each K, given 2m bytes of 
storage. The CFS daemon precomputes these 
streams when the cattach command is issued for a 
particular key. With the streams precomputed, each 
block encryption requires only one online DES oper- 
ation (the codebook cipher based on K). 


When decryption is performed on the card, the 
streams cannot be wholely precomputed in the card’s 
small local memory. Instead, the card calculates 
DES'(K,, f(p mod m)) and DES'(K,, g(p mod m)) 
for each cipherblock sent to it. (f(p mod m) and 
g(p mod m) are sent to the card from the host com- 
puter as parameters with the cipher block.) Although 
this is computationally slower than the precomputed 
cipher, requiring three DES encryptions per block 
instead of one, bandwidth to the card interface (a 
serial link) remains the primary limitation on encryp- 
tion speed. 


4. Practical Applications 


File system key escrow can support a variety of 
application domains. Ensuring organizational access 
to proprietary data was discussed and motivated 
above. Here, an employee has primary operational 
responsibility for data that belongs to an organization. 
Key escrow allows the organization to provide other 
individuals with emergency access capability in the 
primary employee’s absence. Access by these 
"backup" individuals can be granted, controlled, 
audited and revoked easily, without compromising 
the organization’s ability to maintain and control its 
own information. 


Smartcard-based escrow also facilitates other 
backup access relationships. In the organizational 
scenario above, the primary key holder is "subordi- 
nate" to the escrow holder. Alternatively, a manager 
may be the primary key holder for sensitive-but- 
critical business data for which the keys are escrowed 
with an employee. The escrow key holder may not 
be authorized for routine access, but in the manager’s 
absence may be required to perform "proxy" func- 
tions on the manager’s behalf. Here, the smartcard 
system implements and enforces a common business 
delegation of authority practice. 


Another scenario, which may become more 
important in the future, involves the protection of 
individual personal records. Consider, for example, a 
system in which medical records are encrypted under 
a key known only to the patient. Routine use of these 
records by a health practitioner requires the patient's 
active consent in supplying the key. In an emergency, 
however, access to the records may be required even 
when the patient is physically unable to supply the 
key. A key escrow smartcard, which might remain in 
the physical possession of the patient or be main- 
tained with the records themselves, would enable 
such emergency access but still permit the patient to 
control (and revoke) the routine use of his or her pri- 
vate records. The proposed US national health care 
insurance system includes a smartcard-based identifi- 
cation token into which such a scheme could possibly 
be integrated. 


4.1. Performance 


The standard CFS system employs a software-based 
cryptographic engine that performs encryption on a 
modern workstation at between one and _ three 
Mbps[4]. Because CFS uses the standard file system 
cache, actual performance is much better, with a per- 
formance penalty of only 20-50% above the underly- 
ing file system under typical workloads. The escrow 
access system, on the other hand, performs all crypto- 
graphic operations on the smartcard, which commu- 
nicates with the host workstation at serial link speeds 
(19,200 bps). After protocol and processing over- 
head, cryptographic bandwidth to the card is about 
6,000 bps with the CFS cipher described in the previ- 
ous section. Using the smartcard for decryption 
slows the cryptographic engine by almost three 
orders of magnitude. Cache performance hides this 
slightly, but the escrow access system is by no means 
transparent or fast enough for routine operational use. 


In practice, the reduced performance is rarely 
an issue, since escrow access is not intended to sup- 
port routine processing. (Write operations by the 
escrow agent are not even supported by our imple- 
mentation). The normal mode of escrow operation 
involves copying out those files required for emer- 
gency access, such that the card is not subsequently 
required for their use. 


These are not fundamental limitations. Faster 
smartcards are beginning to emerge in the market, 
along with faster interfaces with bandwidths that 
exceed the crypto-bandwidth of current software 
implementations. PCMCIA cards hold particular 
promise in this area. 


a 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


33 


34 


4.2. Trust Model 


Smartcard-based key escrow does not absolutely 
guarantee that the access policy will be enforced. 
There are risks associated with various parts of the 
system, each of which must be assessed in light of the 
application’s security policy, threat model and avail- 
able alternatives. 


The system depends on the reverse engineering 
resistance of the escrow smartcard devices to control 
access by the escrow agents. Reverse engineering 
could reveal the keys stored on the card and permit 
the escrow agent to create duplicate cards without the 
knowledge of the key owner. Although the risk of 
reverse engineering is difficult to quantify as technol- 
Ogy progresses, commercial smartcards are designed 
to resist this sort of attack. Recent trends in tamper- 
resistant packaging and chip fabrication technology 
suggest the emergence of future products with greatly 
reduced vulnerability to reverse engineering. In 
highly sensitive environments in which the integrity 
of the smartcard is not completely trusted, the card 
can be protected with augmented physical safeguards 
such as sealed envelopes and accountable paper audit 
trails. 


By definition, the escrow agent has access to 
the escrowed data while in possession of the escrow 
card. The only built-in control on the escrow agent is 
access detection when the card is eventually audited. 
If the card is not returned by the agent, however, it is 
not possible to audit past access or prevent future 
access as long as the encrypted data remains avail- 
able. The escrow key serves to limit unauthorized 
use of lost or stolen cards. If no single agent is 
trusted, possession of the card and the escrow key 
can be split among two or more agents. These risks 
are largely a function of the relationship between the 
escrow agent and the key owner. When appropriate, 
the owner can periodically audit the escrow card 
throughout the escrow period. Controls on access to 
the encrypted escrowed data can further ameliorate 
the risk of unauthorized access by the agent. 


Any escrow system carries the risk of "cheat- 
ing" by a key owner who encrypts data with keys 
other than those escrowed. This risk is present any 
time the key owner is able to supply his or her own 
cryptographic system. The check file in the smart- 
card system guards only against mistakes, not against 
deliberate deception. All escrow systems suffer from 
this limitation. In a centralized key distribution sys- 
tem, nothing prevents the use of "out of band" keys 
not obtained from the key center. In a system such as 
the government Escrowed Encryption Standard[8] 
(the "Clipper chip"), it is possible to suppress the 


escrow exploitation field in the data stream or pre- 
encrypt with a secure non-escrowed cryptosystem. 
(The government system attempts to reduce this risk 
by supplying the escrowed devices in tamper- 
resistant modules, making it difficult to deploy the 
cipher without the escrow features.) 


The risk of end-user escrow circumvention 
depends on the relationship between the key owner 
and the escrow agent. If escrow is perceived as a ser- 
vice for the mutual benefit of the key owner and 
agent, this risk is not an issue. If, on the other hand, 
this relationship is adversarial, there can be no com- 
pletely reliable mechanism that prevents Cheating. 


5. Conclusions 


Key escrow is not appropriate for all file encryption 
applications. Some data are simply too private; per- 
sonal diaries, certain individual medical and financial 
records and other data for which there is no motiva- 
tion for the data owner to allow third party access are 
poor candidates for escrow. Other data, such as day- 
to-day operational business records, have such high 
availability requirements to preclude any encryption 
at all. Escrow serves the "middle ground" for which 
security requirements suggest the need for crypto- 
graphic protection while availability requirements 
dictate the need for access. 


Smartcard-based escrow overcomes the major 
shortcomings of software-based and manual escrow 
systems. Unlike manual systems, the escrowed keys 
can be reliably pre-audited to ensure their validity 
without compromising sensitive data. And unlike 
either system, once the card is returned, the owner is 
assured of whether the escrow process was used and 
that no further decryptions can occur. Escrowed 
decryption is completely under the control of the 
card; past possession of the card conveys no future 
privileges. 
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7. Availability 


A research prototype of the base CFS system (imple- 
mented as a user-level NFS server) is available free 
upon request within the US and Canada. We regret 
that US Government-imposed export restrictions pre- 
vent us from making it available elsewhere. For 
information, ftp dist /mab/cfs.announce from 
research.att.com or send email _ to 
cfs@research.att.com. The smartcard soft- 
ware, including the escrow system described here, is 
not presently available. 
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Abstract 


As the number of businesses and 
government agencies connecting to the Internet 
continues to increase, the demand for Internet 
firewalls — points of security guarding a private 
network from intrusion — has created a demand for 
reliable tools from which to build them. We present 
the TIS Internet Firewall Toolkit, which consists of 
software modules and configuration guidelines 
developed in the course of a broader ARPA- 
sponsored project. Components of the toolkit, while 
designed to work together, can be used in isolation 
or can be combined with other firewall components. 
The Firewall Toolkit software runs on UNIX® 
systems using TCP/IP with the Berkeley socket 
interface. We describe the Firewall Toolkit and the 
reasoning behind some of its design decisions, 
discuss some of the ways in which it may be 
configured, and conclude with some observations as 
to how it has served in practice. 


Overview 


Computer networks by their very nature 
are designed to allow the flow of information. 
Network technology is such that, today, you can sit 
at a workstation in Maryland, and have a process 
connected to a system in London, with files mounted 
from a system in California, and be able to do your 
work just as if all of the systems were in the same 
room as your computer. Impeding the free flow of 
data is contrary to the basic functionality of the 
network, but the free flow of information is contrary 
to the rules by which companies and governments 
need to conduct business. Proprietary information 
and sensitive data must be kept insulated from 
unauthorized access yet security must have a 
minimal impact on the overall useability of the 
network. 


The purpose of an Internet firewall is to 
provide a point of defense and a controlled and 
audited access to services, both from within and 
without an organization’s private network. This 
requires a mechanism for selectively permitting or 
blocking traffic between the Internet and the 
network being protected!. Routers can control traffic 
at an IP level, by selectively permitting or denying 
traffic based on source/destination address or port. 
Hosts can control traffic at an application level, 
forcing traffic to move out of the protocol layer for 
more detailed examination. To implement a firewall 
that relies on routing and screening, one must permit 
at least a degree of direct [P-level traffic between the 
Internet and the protected network. Application level 
firewalls do not have this requirement, but are less 


flexible since they require development of 
specialized application forwarders known as 
“proxies.” This design decision sets the general 


stance of the firewall, favoring either a higher degree 
of service or a higher degree of isolation. [1] 


A proxy for a network protocol is an 
application that runs on a firewall host and connects 
specific service requests across the firewall, acting as 
a gateway. Figure | represents a minimal TELNET 
service proxy, in which the proxy forwards user’s 
keystrokes to a remote system, and maintains audit 
records of connections. Proxies can give the illusion 
to the software on both sides of a direct point-to- 
point connection. Since many proxies interpret the 
protocol that they manage, additional access control 
and audit may be performed as desired. As an 
example, the FTP proxy can block FTP export of 
files while permitting import of files, representing a 
granularity of control that router-based firewalls 
cannot presently achieve. Router-based firewalls can 
provide higher throughput, since they operate at a 


! Or, in general, between any two networks where 
one needs to be protected from the other. 
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protocol level, rather than an application level, but 
practical experience running firewalls on modern 
RISC processors shows that with a T-1 connection, 


Figure 1: An Application Proxy 
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Proxies exist for a wide variety of services, 
such as X, FTP, TELNET, etc. Perhaps the most 
significant security benefit of employing proxies is 
that they provide a convenient opportunity to require 
authentication. For example, when connecting into a 
protected network from the Internet, one must 
typically first connect to the proxy, authenticate to it, 
and then complete a connection to a host within the 
protected network. The proxy protects the firewall 
host itself, by eliminating the need for the user to log 
into the firewall itself, and it protects the network by 
permitting only authenticated users to gain access 
from the outside. While hosts on the private network 
may still be rife with security holes, restricting the 
incoming traffic to authenticated users only is a good 
step in the right direction. 


Other services, such as Internet (SMTP) 
mail and USENET news, act as store-and-forwarders 
already, and fit in with the proxy approach to 
firewalls. These service daemons sometimes run with 
system privileges and may contain bugs that an 
attacker can exploit. Many existing firewalls rely on 
approximate assessment of privileged systems 
software for their trustworthiness. This is sufficient 
if there are “well known working versions” of 
common programs such as the FTP server, ftpd. In 
some cases, however, the server can _ itself 
compromise security. A recent version of the 
WUArchive ftpd{2] contained a bug that permitted 
anyone on the Internet to gain super-user access to 





the bottleneck tends to remain the T-1 link rather 
than the firewall itself. 
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systems on which it was running. In our design, we 
attempt to sidestep the issue by providing proxies 
that can run locked into a specific subdirectory by 
means of “chroot” — a UNIX system call that 
permanently restticts the working filesystem of a 
process. Proxies are also designed to run without 
special system privileges, to further reduce the 
chance that they might be able to damage the system. 
Ideally it should be impossible for an outside user to 
ever interact with a privileged process. Practically 
speaking, the Internet service master daemon inetd, 
which is responsible for starting other service 
daemons, needs to run with privileges, but outside 
users cannot interact directly with it. There is a 
possibility that the kernel may have trapdoors or 
hidden network services built into it, but it is 
impractical to attempt to obtain and examine kernel 
sources for such flaws. Instead, make every effort to 
remove unnecessary kernel services at system build 
time. 


Design Philosophy 


The TIS Firewall Toolkit (hereafter 
referred to as “the toolkit’) is designed to be 
informally verified for correctness as a whole or at a 
component level. Since the firewall consists of 
discrete components, each providing a single service, 
each may be examined separately from the rest of the 
system. Components of the toolkit are as simple as 
possible in their implementation, and are distributed 
in source code form to encourage peer review. This 


appears to be a fairly novel approach for a network 
firewall, as many existing firewall systems rely on 
software that is “known to be good” or that is 
considered trustworthy because it has been used 
extensively for a long time. 


One problem with the “known to be good” 
approach is that historically it hasn’t been very 
reliable. Certain software components are frequently 
exploited in break-ins, no matter how carefully they 
are maintained. Problem programs are usually 
complex pieces of software, implemented in tens of 
thousands of lines of code, which require system 
privileges in order to operate. As a step towards 
addressing this, the firewall toolkit operates in 
accordance with the following general firewall 
design principles: 


e Even if there is a bug in the implementation of a 
network service, it should not be able to compromise 
the system. Services that are misconfigured should 
not work at all, rather than opening holes. 


e Hosts on the untrusted network should not be able 
to connect directly to network services that are 
running with privileges. 


e Network services are implemented with a 
minimum of features and complexity. The source 
code is simple enough to be reviewed thoroughly and 
quickly. 


e There should be reasonable and pragmatic means 
of testing that the system is correctly installed. 


The toolkit is designed to be used with a 
host-based security policy, but its components can be 
used with router-based firewalls. In this paper, we 
will focus on the former. In a host-based firewall, the 
security of the host is crucial; once it is compromised 
the entire network is open to attack. Still, we believe 
that a host-based firewall is superior to other 
solutions because of the ease with which it can be 
maintained, configured, customized and audited. 
When the toolkit is used with router-based firewalls, 
it is assumed that the toolkit software is running on a 
secure host that is permitted some degree of access 
between the protected network and the Internet, by 
means of routers. This leaves the option of 
configuring the routers to provide additional avenues 
between the protected network and the Internet for 
whatever reason; such additional avenues are outside 
the scope of the toolkit and should be provided only 
after careful security analysis. 


The toolkit may be used in conjunction 
with router-based screening as extra security. To 


minimize risks, the services that are provided on the 
external machine, which we will refer to as a 
“bastion host”, following the terminology proposed 
by Ranum{3}. are sharply curtailed and each service 
is subjected to review. On the “standard” firewall 
configuration, the only services supported are 
SMTP, FTP, NNTP, and TELNET. Other proxies 
such as Digital Equipment Corporation’s X Window 
System proxy [4] can be added to this architecture. 


SMTP service is supported through a non- 
privileged front end that runs locked in a “safe 
directory” via chroot. FTP is supported via a proxy 
that runs without requiring special privileges. NNTP 
is supported via a “tunnel” server that permits traffic 
between a host on the inside and its news server on 
the outside. TELNET service is via a proxy that 
runs unprivileged. Since all other services on the 
system are disabled selectively, it is only these four 
services that must be analyzed for risk. By analyzing 
of the security of each service in isolation, we are 
able to gain a degree of trust in the system beyond 
merely being able to state “Well, we don’t think there 
are any bugs.” With all the services running 
unprivileged we can make a stronger statement, to 
wit, “The security of an individual service is 
irrelevant to the overall security, as the server is 
running in a captive mode.” 


Configuration and Components 


Figure 2 represents the toolkit installed in 
an environment that combines routers and a firewall 
bastion host. The implementation of the security 
controls is shared (in this example) between the 
routers and the firewall: the routers are responsible 
for controlling network-level access, and the bastion 
host provides application-level control. A simpler 
firewall configuration would consist of a dual-homed 
gateway, in which a workstation with two network 
interfaces is connected to both networks, and has IP 
forwarding disabled. Dual homed gateways are less 
flexible than firewalls that combine routers and 
hosts, since the option to route services at a network 
level is generally not available.* On the other hand, 
with a dual-homed gateway, the administrator can 
have a higher degree of confidence that no network 
traffic will be able to somehow “leak” through a 
router, since routers are no longer an integral part of 
the security system. 





2 Some versions of UNIX support packet screening 
within the operating system. 
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Figure 2: A Screened Host Firewall 
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The toolkit is designed to build a host- 
based firewall, with security being enforced by a 
single bastion host. For ease of management, all the 
proxies and access control tools use a single 
configuration file with a regular syntax. We thought 
this was useful due to the generally complex 
configuration of various publicly available firewall 
tools, of which no two are configured in the same 


# Example ftp gateway rules: 


ftp-gw: authserver 
ftp-gw: denial-msg 
ftp-gw: welcome-msg 
ftp-gw: help-msg 
ftp-gw: timeout 3600 
ftp-gw: permit-hosts 
ftp-gw: deny-hosts 
ftp-gw: permit-hosts 
ftp-gw: permit-hosts * -authall 


£28:552446..* 


The firewall toolkit functionality can be 
broken down into six areas: logging, electronic mail, 
the Domain Name Service, FTP, TELNET, and TCP 


access control. 
Logging 


Significant security events and audit 
records are logged to a protected host on the internal 
network via the syslog facility. The version of 
syslogd that the toolkit uses is based on the BSD 


127; 0.0)... CVF? 
/usr/local/etc/ftp-deny. txt 
/usr/local/etc/ftp-welcome.txt 
/usr/local/etc/ftp-help.txt 


1922334112 .100 


way. The configuration rules are designed to provide 
both configuration and _ service and _ access 
permissions information, being read top-to-bottom 
and left-to-right. Hostnames or IP addresses 
including simple wildcards can be used in 
configuration rules, but IP addresses are preferred 
since DNS addresses are vulnerable to spoofing. 


192.33.112.* -log { retr stor } -auth { stor } 


“net2” sources, with some modifications to support 
pattern-matching and program execution on matched 
patterns. Many systems administrators have batch 
processes set up on their systems to alert them of 
possible security problems by searching the system 
logs at regular intervals. By permitting the system 
manager to add regular expressions to the syslogd 
configuration, security-related log messages can be 
identified instantly. Syslogd contains further 
modifications that permit an arbitrary command to 
be invoked with any specified logging rule, so that, 





1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


for example, vitally important security log events can 
be delivered to the system manager’s beeper or 
delivered by electronic mail. Adding command 
execution to syslogd implies that the syslogd 
configuration file must be protected against 
unauthorized modification. 


Electronic Mail 


Mailers are one of the favorite points of 
attack against UNIX systems. The Morris Internet 
worm exploited a well-known hole in the standard 
UNIX SMTP server, sendmail. Many systems 
running sendmail, including those with Internet 
firewalls, were penetrated by the worm. A few that 
had replaced sendmail with other SMTP servers 
were not. Since that time, a variety of other security 
holes have been identified in sendmail and fixed in 
more recent releases. 


The problem with mailers is twofold: they 
are complex and perform file system activity, and 
they require privileges so that they can manipulate 
mailboxes or execute mail processing programs on 
the behalf of users. To help secure mail service, 
direct network access to sendmail is prevented. A 
simple program that implements a skeleton of the 
SMTP protocol is presented on the SMTP port on the 
mail server. This SMTP proxy, called smap, is small 
enough to be subjected to a code review for 
correctness (unlike sendmail) and simply accepts all 
incoming messages and writes them to disk in a 
spool area. Rather than running with permissions, 
the proxy runs with a restricted set of permissions 
and runs “chrooted” to the spool area. A second 
process is responsible for scanning the spool area 
and delivering the mail messages to the real 
sendmail for delivery — a mode of operation in 
which sendmail can operate with reduced 
permission. Many Internet firewalls run sendmail 
and rely on “trustworthy” versions of the software; 
running the mail software in a reduced-permissions 
mode is a more general solution to the problem, side- 
stepping the issue of whether or not a given version 
of sendmail contains bugs. 


While smap answers all valid SMTP 
commands sent to it, it does not execute any of them 
except those directly involved with mail exchange: 
HELO, FROM, RCPT, DATA, and QUIT. Other 
commands, such as VRFY and EXPN return a polite 
error message. Smap preserves sendmail’s 
functionality, while preventing an arbitrary user on 
the network from communicating directly with it. 
Analyzing sendmail ‘s 20,000 lines of source code for 
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bugs is a sizable task when compared to analyzing 
smap‘s 700 lines. Smap is not a panacea, however, as 
firewalls remain vulnerable to data-driven attacks in 
which messages may be mailed to hosts on the 
private network, possibly triggering security holes in 
internal mailers. Since many of these attacks have a 
distinctive signature, smap or the firewall’s mailer 
can be configured to attempt to identify these letter- 
bombs, but the security administrator is forced into 
the unfortunate position of an arms-race in which a 
reactive role must be taken as new attacks are 
invented. To reduce the risk of attacks that exploit 
mailing through programs, the mailer on the firewall 
itself is configured so that program execution is 
disabled. Disabling program execution is often an 
unacceptable solution on a multi-user system, but 
since the firewall is not a general use host, we prefer 
to reduce the risk of someone being able to execute 
arbitrary commands from afar. 


Domain Name Service (DNS) 


The name service software available for 
UNIX implements an in-memory read-only database. 
As such, it cannot be used to gain unauthorized 
access to a system. Some past attacks on firewalls 
have used name service spoofing as a technique for 
impersonating trusted network hosts. In order to 
remove the threat of name service spoofing, the 
firewall does not rely on name service for any 
security related information. The name _ server 
software is necessary for high performance large- 
scale mail systems and is configured so that the only 
application that relies on name service for 
addressing is the electronic mail system. DNS names 
are also used in audit records, but are always 
presented along with host network addresses; 
mismatches are flagged as possible spoofing 
attempts. 


FTP 


The FTP application gateway is a single 
process that mediates FTP connections between two 
networks. Since it performs no disk access other 
than reading its configuration file and is a small and 
relatively uncomplicated program, it can be argued 
that it is not capable of compromising the security of 
the system. Just to be certain, the application 
gateway runs as a non-privileged user, after 
“chrooting” itself to a private directory on the 
system. To control FTP access, the application 
gateway reads a configuration file, containing a list 
of FTP commands that should be logged, and a 
description of what systems are allowed to engage in 
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FTP traffic. All traffic can be logged and 
summarized. Optionally, the gateway can permit 
FTP traffic from the Internet to the campus network 
for users who first authenticate themselves to the 
system. 


TELNET 


The TELNET application gateway is a 
small, simple application that mediates TELNET 
traffic. As with the FTP application gateway, the 
only file accessed is the configuration file that is read 
at start-up. Immediately after the configuration file is 
read, the TELNET application gateway is “chrooted” 
to a restricted directory, where it runs as a non- 
privileged process. The TELNET gateway’s 
configuration file allows specification of which 
systems or networks can use it, and what systems or 
networks it will permit connection to. Initially, it 
will be configured to permit campus systems to use 
the gateway to connect to Internet systems, but not 
vice-versa. Optionally, the TELNET gateway can 
require authentication before permitting use. All 
connections and their durations are logged. 


UDP-Based Services 


Since we decided that no direct traffic 
would be permitted between an outside system and 
an inside system, and since UDP is connectionless 
and point-to-point (and so cannot be used through 
network proxies), UDP services are not allowed. 
Many UDP-based services such as NTP and DNS 
can be provided transparently through a firewall by 
configuring the servers to act as forwarders for 
queries Originating within the protected network. 


TCP Access and Use 


On BSD-based UNIX systems, most 
network processes are started up by an _ initial 
connection to a general-purpose network listener 
inetd, which establishes a connection between the 
incoming request and the program to service the 
request. For example, an incoming request for the 
TELNET service is “heard” by the running network 
listener. The program, according to inetd’s 
configuration file and the entry for TELNET, is 
executed and connected to the incoming request. 


Inetd, the Internet services daemon, 
performs no function other than to invoke specified 
processes to manage network services when a system 
attempts to connect to them. Some vendor 
implementations permit a systems administrator to 
specify the user-id that the service should be invoked 


as, but there is no provision for limiting access based 
on the source of the request. A variety of 
implementations of ‘wrapper’ processes are 
available on the Internet with varying 
functionality(5). 


The toolkit uses a “wrapper” process 
called netacl, which provides support for all TCP- 
based services. (If only TCP-based services are 
supported, UDP services are disabled and are no 
longer a threat worth worrying about.) Netacl has 
no great advantages over other versions of TCP 
wrappers, other than its minimal size (240 lines of 
code, including a large copyright header and 
comments), its lack of support for UDP (purposely), 
and its sharing a common configuration mechanism 
with the other tools in the toolkit. 


TCP Plug-Board Connection 
Server 


Certain services such as Usenet news are 
often provided through a firewall. In such a 
situation, the administrator has the choice of either 
running the service on the firewall machine itself or 
installing a proxy server. Since running news on the 
firewall itself might expose the system to any bugs in 
the news software, it is safer to use a proxy to 
gateway the service onto a “safe” system on the 
campus network. Plug-gw is a general purpose proxy 
that “plugs” two services together transparently. Its 
primary use is for supporting Usenet news, but it can 
be employed as a general-purpose proxy if desired. 
Plug-gw is configurable, as are the other proxy 
servers. Since it only acts as a data pipe, it performs 
no local disk I/O and invokes no subshells or 
processes. Like the other proxy servers, it logs all 
connections. 


Plug-boarding TCP connections through 
one’s firewall should be undertaken with a degree of 
caution, since plug-gw uses no authentication other 
than the host address of the client, and does no 
examination of the traffic passing across it. In the 
case of NNTP, for example, a security flaw in the 
NNTP server on the internal host could still be 
exploited. The firewall will make it much harder for 
an attacker to gain access to the internal system to 
further exploit the hole; if the flawed NNTP server 
were running on the firewall bastion host itself, the 
entire firewall might be vulnerable. Alternate 
approaches, such as engineering the news server to 
run “chrooted’” are potential areas for future 
research. From a_ standpoint of — systems 
administration, we have found that news 
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administration is simplified by running it a readily 
accessible internal server. 


User Authentication 


The network authentication server authsrv 
provides a generic authentication service for toolkit 
proxies. Its use is optional, required only if the 
firewall FTP and TELNET proxies are configured to 
require authentication. Authsrv acts as a piece of 
“middleware” that integrates multiple forms of 
authentication, permitting an administrator to 
associate a preferred form of authentication with an 
individual user. This permits organizations that 
already provide users with authentication tokens to 
enable the same token for authenticating users to the 
firewall. A secondary goal of authsrv was to provide 
a simple programming interface for authentication 
service, since commercial authentication systems 
tend to have unique, nonstandard, interfaces. Several 
forms of challenge/response cards are supported, 
along with software-based one-time password 
systems, and plaintext passwords. Use of plaintext 
passwords over the internet is strongly discouraged, 
due to the threat of password sniffing attackers. 


A simple administrative shell is included 
that permits the authentication database to be 
manipulated over a network, with optional support 
for encryption of authentication transactions. The 
authsrv database supports a basic form of group 
management; one or more users can be identified as 
the administrator of a group of users, and can add, 
delete, enable, or disable users within that group. 
Authsrv internally maintains information about the 
last time a user authenticated to the server and how 
many failed attempts have been made. It can 
automatically disable or time-lock accounts that have 
multiple failures. Extensive logs are maintained of 
all authsrv transactions. Authsrv is intended to run 
on a secured host, such as the bastion host itself, 
since its database must be protected from attack. 


Testing Firewalls 


Throughout the design of the toolkit, we 
tried to design each component so that it relied 
wherever possible on protections in the UNIX 
environment, rather than on elaborate code designed 
to check and deter threats. While the toolkit software 
doesn’t include a test suite, it is designed to be easy 
to verify that each component operates as it is 
intended. As an example, the SMTP proxy smap 
runs “chrooted” to a subdirectory as an unprivileged 
process. It stands to reason that if the proxy performs 
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this operation properly, all files will be created in the 
proper directory, with the proper user permissions. If 
the administrator verifies that this is indeed the case, 
he can rely on the security of the operating system’s 
support for “chroot” and user file permissions. By 
examining the assumptions of each service proxy, a 
degree of assurance that the firewall is well protected 
can be gained. This does not address the problem of 
possible bugs or protocol errors in the proxy 
implementations that might still permit a service to 
pass through the firewall. To attempt to address this, 
every effort is made to keep the implementation of 
the proxies, especially the parts that deal with access 
control, as simple as possible. 


Firewall administration requires a 
seasoned UNIX systems manager. While the toolkit 
is fairly easy to install, it assumes an amount of 
expertise on the part of the administrator, since he 
must know how to interpret error conditions, 
configure the system, and disable potentially 
threatening services. While it is a temptation to 
make the toolkit software self-installing and self- 
configuring, doing so raises the possibility that 
someone might install it who lacks the basic skills 
necessary to know if they have in fact secured their 
network. Packaging the toolkit as a set of 
components that can be used freely has proven 
effective, since it fills a need on the part of those 
experienced system managers who would have had 
to design, write, debug, and test their own 
implementations if ours were not available. 


Future. Directions 


In the future we will focus on the problem 
of adding newer interactive information retrieval 
services such as Gopher, WAIS and World Wide 
Web and broadcast services such as MBONE. 
Possible avenues for future research include 
integrating cryptography with the firewall software 
to permit firewall-to-firewall service and firewall-to- 
firewall authentication, possibly using kerberos 
protocols. Support for IP-on-demand services like 
PPP pose a problem for firewalls: is the dial-up user 
to be treated as an untrusted Internet host or as a part 
of the protected network? Adding support for 
authenticated and encrypted PPP service on the 
firewall itself is being examined. 


Observations 


In practice, we find that running servers 
without special system privileges increases our 
assurance that the firewall is secure. More 
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importantly, the methodology of turning off all 
services but a minimum, and then auditing each one 
on a case-by-case basis further increases confidence 
that the system is harder to break into. The basic 
design decisions in setting up a firewall (to route or 
not to route, to rely on the host or the router) remain 
unchanged, but the toolkit will work with either 
model. 


Firewalls are a stop-gap measure that is 
needed because many services are developed that 
operate either with poor security or no security at all. 
Perhaps the most important lesson we can learn from 
firewalls is the need for strong session-level 
authentication in applications and well-designed 
application protocols. 


Availability 


The TIS Internet Firewall Toolkit is 
available in source form via anonymous FTP from 
ftp.tis.com: /pub/firewall/toolkit/fwtk.tar.Z. 
Information is available from the authors at fwall- 
support@tis.com. Send mail to  fwall-users- 
request@tis.com to be added to the firewall toolkit 
user's mailing list. Future enhancements to the 
toolkit will be announced on fwall-users and other 
relevant mailing lists. 
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Abstract 


SNP provides a high-level abstraction for secure end-to- 
end network communications. It supports both stream and 
datagram semantics with security guarantees (e.g., data 
origin authenticity, data integrity and data confidentiality). 
Itis designed to resemble the Berkeley sockets interface so 
that security can be easily retrofitted into existing socket 
programs with only minor modifications. SNP is built on 
top of GSS-API, thus making it relatively portable across 
different authentication mechanisms conforming to GSS- 
API. SNP hides the details of GSS-API (e.g., credentials 
and contexts management), the communication sublayer 
as well as the cryptographic sublayer from the applica- 
tion programmers. It also encapsulates security sensitive 
information, thus preventing accidental or intentional dis- 
closure by an application program. 


1 Introduction 


The explosive growth of network connectivity has signif- 
icantly aggravated the problem of security. Most existing 
network programming paradigms adopt a trust-based ap- 
proach to security (e.g., trusting network packets, trusting 
hosts). This is no longer adequate, especially for mali- 
cious attacks. Indeed, with easy access to networks and 
availability of sophisticated network tools, the effort to 
mount attacks such as spoofing network packets or sniff- 
ing illicit information from network traffic is substantially 
reduced. To effectively counter these attacks, a coherent 
security infrastructure is needed. An important element of 
such a infrastructure is a convenient abstraction for secure 
application network programming. 

In recent years, distributed systems security has received 
a great deal of attention. For example, a number of au- 
thentication systems (e.g., Kerberos from MIT [15], SPX 
from DEC [17] and KryptoKnight from IBM [9]) have 
been designed and implemented. Although these systems 


*Research supported in part by NSA INFOSEC University Research 
Program under contract no. MDA 904-91-C7046 and MDA 904-93- 
C4089, and in part by National Science Foundation grant no. NCR- 
9004464. 
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do provide an adequate solution for typical network se- 
curity concerns, they suffer a major common drawback, 
namely, it is difficult to integrate them into an applica- 
tion. More specifically, they do not export a clean and 
easy-to-use interface that can be readily used in imple- 
menting a distributed service. For example, it often takes 
a considerable amount of effort to “kerberize” an exist- 
ing distributed service. Besides, the interface provided is 
not portable, making the switch from one authentication 
system to another a non-trivial task. 

The recently published Internet draft standard Generic 
Security Service Application Program Interface (GSS- 
API) [8] alleviates the problem somewhat. In fact, both 
SPX and KryptoKnight’ have already implemented a 
small subset of GSS-API.? However, the GSS-API inter- 
face is still too low-level to be practical for typical network 
application programming. Its proper use requires intimate 
understanding of the underlying GSS-API concepts, which 
can cause significant distraction from the main task of a 
program. It is valid to say that GSS-API is more suited 
for use in system software than in regular application pro- 
gramming. Indeed, it is intended that a typical caller of 
GSS-API be a communication protocol, e.g., telnet, ftp [8, 
p. 2]. 

We believe that what is needed is an abstraction for 
secure network programming that can hide most of the 
details of GSS-API while retaining the same ease of 
use aS most existing abstractions for network program- 
ming. As an analogy, the raw interface to a protocol (e.g., 
tep_input ()/tcp_output() for TCP) is often dif- 
ficult to use, whereas programming using a higher-level 
abstraction (e.g., sockets, TLI) is significantly easier.* 

In this paper, we discuss the design and implementa- 
tion of SNP (Secure Network Programming), a high-level 


1KryptoKnight is not public-domain. The use of GSS-API is men- 
tioned in [9]. But it is not clear to what extent the interface has been 
implemented. 

2 A recent article in comp.protocols.kerberos states that implementa- 
tions of GSS-API for Kerberos will also be available. 

3In fact, Berkeley sockets have often been touted as a major con- 
tributing factor to the popularity of TCP/IP. Although Berkeley sockets 
can support a variety of protocols, it was designed mainly with TCP/IP 
in mind. 
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abstraction for secure network programming that we have 
developed. SNP is like sockets or TLI in that it is an 
interface that provides applications access to network com- 
munications. However, it differs from sockets or TLI in 
many significant ways: 


e SNP provides secure network communication. For 
example, it provides data origin authenticity, data 
integrity and data confidentiality services on top of 
the usual stream and datagram services provided by 
sockets or TLI. The precise services provided by SNP 
are detailed in Section 4. 


e SNP provides an end-to-end communication abstrac- 
tion at the application level, whereas sockets and TLI 
are transport level abstractions.* More specifically, 
a socket represents a transport level endpoint (e.g., a 
TCP port), while an SNP endpoint represents an ap- 
plication layer entity (e.g., a server). This distinction 
is important and is further explained in Section 3. 


SNP is implemented on top of GSS-API. It is currently 
in the form of a library. It adopts the same basic design as 
sockets (though several new calls have been added), which 
allows easy transitions from socket-based programs. 

The balance of this paper is organized as follows. In the 
next section, we present an overview of SNP. This provides 
a quick introduction to SNP before delving into details in 
later sections. In Section 3, we elaborate on a list of design 
requirements and decisions we have made in the design of 
SNP. In Section 4, we provide a high-level description 
of the services offered by SNP. In Section 5, we give a 
specification of the SNP interface. In Section 6, we discuss 
various considerations that arise in implementing SNP. In 
Section 7, we provide some figures on the performance 
of our implementation. In Section 8, we compare SNP to 
some related systems. In Section 9, we discuss the lessons 
learned and directions for future work. 


2 Overview of SNP 
2.1 A Quick First Look 


To give a quick introduction of what SNP is, we begin by 
looking at actual SNP code fragments. Figures 1 and 2 
show respectively the typical client and server SNP code. 

As can be easily seen, the SNP interface closely resem- 
bles that of sockets. This resemblance is nota coincidence. 
Rather, it was a design decision (see Section 3 for the ra- 
tionale). In fact, most of the calls even retain their familiar 


4This is not strictly true as sockets also provide access to protocols 
in other layers in the communication hierarchy, cf., raw sockets etc. 
However, sockets and TLI are typically considered to be transport layer 
interfaces, 


semantics from their socket counterparts, though their im- 
plementations are quite different. In the following, we 
will focus only on the calls that are new in SNP. 


There are two main new calls, namely snp() and 
snp-attach(). snp() replaces the socket () call 
in the socket interface. It is similar in functionality to 
socket () in that it creates a communication endpoint. 
It differs from socket () , however, in that an SNP end- 
point corresponds to an application layer entity rather than 
a transport layer entity. 


In addition to SOCK_STREAM and SOCK_DGRAM, 
snp() supports two new kinds of communication se- 
mantics: SNP_STREAM and SNP_DGRAM. Both extend the 
semantics of their respective socket counterparts by adding 
security guarantees. Specifically, an SNP_STREAM con- 
nection is authenticated. That is, a connection would be 
made only if it is accepted by the intended peer (specified 
by the initiator); and conversely, the identity of the initia- 
tor can be uniquely determined by the intended peer once 
a connection is made. Additional security services (e.g., 
data integrity, data confidentiality) can be activated on an 
SNP_STREAM connection by setting the appropriate op- 
tions using snp_setopt () (see Section 4). Essentially, 
an SNP_STREAM connection can be understood as a con- 
nection that supports the semantics of a SOCK_STREAM 
connection even in an environment with intruders.° The 
case for SNP_DGRAM is similar. 


snp_attach() is a completely new call; it does not 
have a socket counterpart. The main function of this call 
is to attach an identity to an SNP endpoint. The attached 
identity is the one that would be authenticated to a peer. 
An identity is not just a name, it is a supported claim of 
a particular name.® In other words, an identity can be 
unambiguously verified to another party. In terms of im- 
plementation, an identity consists of a name together with 
a set of credentials that corroborate the authenticity of 
the name. Typically, the operation of snp_attach ( ) 
involves collecting the appropriate credentials (locally 
and/or remotely) for supporting the specified name. It 
should be noted that the identity attached to an SNP end- 
point needs not be the identity of the caller per se. For 
example, adelegate may want to attach its identity as a del- 
egate rather than its own identity. Operationally, a caller 
is allowed to attach any identity to an endpoint as long as 
it is able to gather the required credentials to support that 
identity. 


° Assuming proper options are set corresponding to the types of threats 
anticipated. 

6 A better term for this is principal. But we intend to keep it informal 
in this paper and refrain from introducing too many terms. 
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#include <snp.h> 


if ((snp_ep = snp(AF_INET, SNP_STREAM, SNP_PROTO_DEFAULT)) < 0) 
snp_perror("snp() error :"); exit(1); 


) 
/* Initialize local and peer addr structs - just as in sockets */ 


/* Initialize local & peer name structs as shown below */ 
local_name.np.np_val = (char *) malloc(sizeof(client_name) ) ; 
local_name.np.np_len = strlen(client_name) ; 

strcepy (local_name.np.np_val, client_name) ; 

peer_name.np.np_val = (char *) malloc(sizeof(server_name) ); 
peer_name.np.np_len = strlen(server_name) ; 

strcpy (peer_name.np.np_val, server_name) ; 


if (snp_attach(snp_ep, &local_name, &peer_name) < 0) { 
snp_perror("“snp_attach() error :"); exit (1); 

) 

if (snp_connect(snp_ep, sizeof(struct sockaddr_in), 

(struct sockaddr *) (&peer_addr) ) < 0) { 

snp_perror("*snp_connect() error :"); exit(1); 

) 

if ((nmumbytes = snp_write(snp_ep, buf, buf_size)) < 0) ( 
snp_perror("“snp_write() error :"); exit (1); 

) 


if (snp_close(snp_ep) < 0) ({ 
snp_perror("snp_close() error :"); exit (1); 
) 





Figure 1: Sample SNP Client Program Fragment 


#include <snp.h> 


if ((snp_ep = snp(AF_INET, SNP_STREAM, SNP_PROTO_DEFAULT)) < 0) { 
snp_perror("snp() error :"); exit(1); 

) 

/* Initialize local and peer addr structs - just as in sockets */ 


/* Initialize local name structs as shown below */ 

local_name.np.np_val (char *) malloc(sizeof(server_name) ); 

local_name.np.np_len strlen(server_name) ; 

strcpy(local_name.np.np_val, server_name) ; 

if (snp_attach(snp_ep, &local_name, &peer_name) < 0) ( 
snp_perror("snp_attach() error :"); exit (1); 


(snp_bind(snp_ep, &server_addr, sizeof(server_addr)) < 0) { 
snp_perror("“snp_bind() error :"); exit(1); 


(snp_listen(snp_ep, 5) < 0) { 
snp_perror("snp_listen() error :"); exit(1); 


((nmew_snp_ep = snp_accept(snp_ep, (struct sockaddr *) &peer_addr, 
&addr_len)) < 0) { 
snp_perror(“snp_accept() error :"); exit(1); 


(snp_getpeerid (snp_ep, &client_name) < 0) { 
snp_perror("“snp_getpeerid() error :"); exit(1); 


((numbytes = snp_read(new_snp_ep, buf, buf_size)) < 0) ( 
snp_perror("“snp_read() error"); exit(1); 


(snp_close(new_snp_ep) < 0) ({ 
snp_perror(“snp_close() error :"); exit(1); 


(snp_close(snp_ep) < 0) { 
snp_perror("snp_close() error :"); exit(1); 





Figure 2: Sample SNP Sequential Server Program Fragment 
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Figure 3: Organization of SNP 


2.2 Components of SNP 


SNP is designed and implemented in a modular fashion. 
Each major functionality of SNP is encapsulated in a sep- 
arate layer that exports a well-defined interface. Figure 3 
shows the conceptual layering and major components of 
SNP. 


SNP-API defines the upper (external) interface avail- 
able to an application. Internally, SNP makes use of two 
lower interfaces: GSS-API and a (insecure) network com- 
munication API. GSS-API encapsulates the details of the 
particular authentication protocol used, thus enhancing 
the independence of the SNP layer from the underlying 
authentication mechanism. Similarly, the communication 
API isolates the details of network communication from 
the SNP layer. We have chosen sockets as the communi- 
cation API, mainly due to its wide availability. 


GSS-API in turns makes use of a lower generic cryp- 
tographic interface. This interface provides access to all 
cryptographic functions and is generic in the sense that it 
can support any (symmetric or asymmetric) cryptosys- 
tem. This provides “cryptosystem independence” and 
facilitates easy substitution when new (implementations 
of) cryptosystems are available. Further discussion of our 
GSS-API implementation and the underlying authentica- 
tion protocol is beyond the scope of this paper; interested 
readers can consult [23, 20] for more details. 


The main function of the SNP layer is context and 
credential management. It initiates the acquisition of cre- 
dentials, monitors the status of contexts and credentials, 
and initiates renegotiation (of contexts) and/or reacqui- 
sition (of credentials), if necessary. It should be noted, 
however, that the actual storage of contexts and creden- 
tials is internal to GSS-API. 


2.3 SNP in Context 


SNP is part of a larger project of ours that concerns the de- 
sign and implementation of an authentication framework 
for distributed systems [21]. The framework addresses a 
range of authentication needs that includes bootstrapping, 
user logins and peer communications. SNP is designed as 
an interface for accessing the peer authentication protocol 
in our framework. 

Because of its modular design, detailed understanding 
of the other components in the framework is not required in 
order to use or understand SNP. Indeed, SNP is relatively 
independent of the original framework it was designed for, 
and should be easily portable for use in other authentica- 
tion frameworks (see Section 3). Therefore, we will only 
briefly describe the other components in our framework, 
to the extent they are required for the understanding of 
SNP. 

At present, our framework has three protocols: a secure 
bootstrap protocol that creates a bootstrap certificate upon 
successful bootstrapping; a user-host mutual authentica- 
tion protocol that creates a login certificate when a user 
successful logs in; and a peer-peer mutual authentication 
protocol that is the basis for SNP. The login certificate 
is retrieved when a user attaches its identity to an SNP 
endpoint. This certificate is stored in an SNP (GSS-API) 
credential structure and is used to authenticate the user’s 
identity to its peer. 

The peer-peer authentication protocol in our framework 
assumes the use of a commonly trusted authentication 
server (AS). Apart from its authentication duty, AS is 
also responsible for generating the session key used in 
an authentication exchange. Thus, in order for SNP to 
function correctly, AS needs to be properly set up. For 
example, it must be properly secured and be given a correct 
database. The discussion of the associated administrative 
issues is beyond the scope of this paper. 
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Lastly, a name service is required for translating ap- 
plication layer entities to their transport layer addresses. 
This name service, however, need not be trusted, as SNP 
performs the proper authentication during connection es- 
tablishment. 


3 Design Requirements and Deci- 
sions 


In designing SNP, we first set out a number of require- 
ments. Based on these requirements, we made several key 
design decisions. We briefly discuss the rationale for these 
requirements and decisions below: 


e SNP should provide end-to-end communication at the 
application layer rather than the transport layer. Al- 
though the transport layer is the first end-to-end layer, 
we believe the concept of identity is only meaningful 
at or above the session layer. For example, in Unix 
and TCP/IP, ports are ephemeral and the association 
of ports with processes is dynamic.’ We believe it 
is more appropriate to base our semantics on appli- 
cation level entities than to assume a secure mapping 
between ports and processes. 


e SNP should be independent of any particular authen- 
tication protocol or framework. This allows SNP to 
be portable across different authentication systems. 
We achieve such independence by using GSS-API 
to encapsulate the details of the underlying protocol, 
and sockets as the communication interface. 


e It should be easy to convert existing network ap- 

plication programs to use SNP. To achieve this, we 
designed SNP to retain as much as possible the gen- 
eral structure of a socket program. Hence: (1) only a 
minimal number of new concepts needs to be learned 
in order to acquaint oneself with SNP; (2) only minor 
(mostly syntactic) modifications need to be done to 
convert a socket program to an SNP program, thus 
significantly facilitating retrofitting. 
We could have emulated the TLI interface instead. 
But we believe that sockets and TLI are sufficiently 
similar to each other that little extra effort is required 
to convert TLI programs into SNP programs. Be- 
sides, there are far more existing socket programs 
than TLI programs, though TLI is quickly gaining 
popularity. 


e SNP should work in a heterogeneous environment. 


This entails careful considerations of message en- 
coding and processing. We have chosen XDR [16] 


7 Reserved ports are a matter of convention only, there is no permanent 
binding. 
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PD, CA, DC, DI, 
DOA, DDA 





Figure 4: Services Provided 


for this purpose, mainly for its simplicity. ASN.1 
[1] is used in other authentication systems (e.g., Ker- 
beros, SPX); we find it to be overly complicated and 
not suitable as a prototyping tool. From our experi- 
ence, XDR has been adequate, though not as flexible 
as we would like. 


e SNP should be independent of particular cryptosys- 
tems. We achieve this by encapsulating all cryp- 
tographic functions using a generic cryptographic 
interface. In our current implementation, we use the 
de facto standard cryptosystem trio, i.e., DES [10] 
for symmetric encryption, RSA [13] for asymmetric 
encryption and MD5 [12] for message digest. 


4 Services Provided 


Security is only well-defined with respect to a threat model. 
In this paper, we assume the standard threat model. Thatis, 
a saboteur can read, insert, delete and modify any network 
traffic. It should be noted that a saboteur is not necessarily 
a totally external intruder, s/he can also be a legitimate user. 
Thus, s/he can use information available to a legitimate 
user in mounting an attack. 

We stress that our model does not include denial of ser- 
vice and traffic analysis threats. It is always possible for 
a saboteur to corrupt all packets passing through. Even 
an infinitely persistent sender cannot overcome such cor- 
ruption if the saboteur is equally persistent. Indeed, most 
network programming abstractions guarantee only safety 
but not progress. 

In the following, we present in high-level terms the ser- 
vices provided by SNP. We first define below the typical 
types of services offered by a secure communication con- 
nection: 


e Persistent Delivery (PD)— A sender will persistently 
try to retransmit data if it has not been received yet. 
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Thus, PD implicitly assumes the use of acknowledg- 
ments. 


e Best Effort Delivery (BED) — Data sent may or may 
not arrive at the receiver. Each of the intermediate 
nodes can either forward or drop the data. 


e Sequenced Delivery (SD) — If data arrives at a re- 
ceiver, it must appear in the same order it was sent. 
That is, no reordering or duplication is allowed. 


e Data Confidentiality (DC) — Data is only legible to 
the intended receiver. 


e Data Integrity (DI) — Data, if accepted by a receiver, 
must bear the same content as that sent. 


e Data Origin Authenticity (DOA) — Data, if accepted 
by areceiver, must have come from a known specific 
sender. 


e Data Destination Authenticity (DDA) — When data 
arrives, a receiver can unambiguously determine that 
it is the intended receiver. 


e Connection Authenticity (CA) — A connection, if 
made, must be between the intended peers. 


SNP can provide different combinations of the above 
services. The precise combinations provided is summa- 
rized in Figure 4. Each of the two boxes is labeled by acon- 
stant denoting the communication semantics, while each 
of the circles is labeled by an SNP option constant. The 
combination of services provided under a particular com- 
munication semantics and set of options is labeled in the 
intersection of the corresponding regions. For example, 
under SNP_STREAM, if both SNP_OPTIONS_SIGNED 
and SNP_OPTIONS_SEQUENCED are set, the ser- 
vices provided are PD, CA, DI, DOA, DDA 
and SD. We note that SNP_OPTIONS_SIGNED and 
SNP_OPTIONS_ENCRYPTED cannot both be set at the 
same time. The case for SNP_DGRAM is similar, and is 
omitted. 


5 The SNP Interface 


As with the socket interface, SNP-API functions can be 
divided into five classes: initialization, connection estab- 
lishment, data transfer, connection release, and utility. We 
describe the functions in each class below. A complete list 
of all functions is given in Figure 5. Parameter names ap- 
pearing in the following subsections refer to those shown 
there. 

Most functions have semantics similar-to their socket 
counterpart. (In fact, they are given the same names mod- 
ulo the prefix “snp-_.”) We have not emulated all the data 


transfer functions of sockets (e.g., sendmsg, recvmsg) 
due to their intricate semantics. Nonblocking I/O is sup- 
ported, but asynchronous I/O (i.e., interrupt driven) is not. 

We also note that most functions below (notable ex- 
ceptions being the data transfer functions) return 0 on 
success and —1 on failure. In addition, a global variable 
snp-errno will contain the appropriate error number on 
failure. 


5.1 Initialization 
Functions in this class are used for creating and initializing 


an SNP endpoint. They include snp(), snp_bind(), 
snp-_listen() andsnp_attach(). 


5.1.1 snp() 


snp() creates an endpoint of communication. Its pa- 
rameters have the same types as socket() and have 
similar semantics. Currently, the only supported value for 
family is AF_INET, corresponding to the internet ad- 
dress family. The possible values of type are shown in 
the following table: 


Secure Stream 
Secure Datagram 


SNP_STREAM 
SNP_DGRAM 
SOCK_STREAM 
SOCK.DGRAM 


Normal (Insecure) Stream 
Normal (Insecure) Datagram 





For protocol, the currently supported values are as 
follows: 



















Default Authentication Protocol 
Push Model Authentication Protocol 
Reverse Authentication Protocol 

Normal TCP 
Normal UDP 


SNP.PROTO.DEFAULT 
SNP.PROTO.PUSH-MODEL 
SNP.PROTO.REVERSE 
IPPROTO.TCP 
IPPROTO.UDP 






A combination of SNP_STREAM and any one of the 
first three protocol values results in a secure equivalent of 
TCP. Similarly, SNP_DGRAM in combination with one of 
the first three protocol constants provides a secure UDP 
protocol. The first three protocol constants can be used 
only when the family argument value has been set to 
either SNP_STREAM or SNP_DGRAM. The use of either 
IPPROTO_UDP or IPPROTO_TCP results in the normal 
(i.e., insecure) UDP or TCP protocols, respectively. These 
are equivalent to the semantics provided by the socket 
interface. 

snp () returns an SNP handle, of type int. The handle 
is an index into an internal table of SNP structures main- 
tained by SNP. Thus, unlike socket (), an SNP handle 
is not a file descriptor. Hence, some of the standard func- 
tions that apply to a socket descriptor will not apply to an 
SNP handle. 

The snp_ep parameter in each of the other functions 
in Figure 5 refer to an SNP handle obtained from a call to 
snp (). 
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Initialization Calls 
int 
int 


family, int type, 


SNnNp_e€p, 


int 
int 


snp ( 
snp_bind ( 
int snp_listen ( int snp_ep, 
int snp_attach ( int snp_ep, 


Connection Establishment Calls 


Snp_ep, 
Snp_ep, 


int snp_connect ( int 
int snp_accept ( int 
Data Transfer Calls 

int 
int 
int 
int 
int 


int 
int 
int 
int 
int 
int 
int 
int 


char 
char 
char 


*buf, 
*buf, 
*buf, 


int 
int 
int 


snp_write 
snp_read 


( snp_ep, 

( 
snp_send ( 

( 

( 


snp_ep, 
snp_ep, 
int snp_ep, char *buf, 
int snp_ep, char *buf, 
struct sockaddr *to, 
snp_recvfrom (int snp_ep, char *buf, 
struct sockaddr *from, 


Connection Release Calls 


int snp_close ( int snp_ep ); 
int snp_shutdown (int snp_ep, int how ); 


Utility Calls 
int snp_setopt 


snp_recv 
snp_sendto 


int 


( int snp_ep, int level, 


struct sockaddr *local_addr, 
int backlog ); 
struct name_s *local_name, struct name_s *peer_name ); 


int optname, char *optval, 


int protocol ); 


int addr_len ); 


struct sockaddr *peer_addr, int peer_addr_len ); 
struct sockaddr *peer_addr, int peer_addr_len ); 


nbytes ); 
nbytes ); 
nbytes, 
nbytes, 
nbytes, 
tolen ); 
nbytes, 
*fromlen ); 


int flags ); 
int flags ); 
int flags, 


int flags, 


int optlen ); 


int snp_getpeerid ( int snp_ep, struct name_s *peer_name ); 





Figure 5: SNP Interface Specification 


§.1.2 snp_bind() 


After creation, an address may be bound to an SNP 
endpoint using snp_bind(). The local_addr and 
addr-_len are of the same types as in the bind () func- 
tion. They specify the address to be bound. 


5.13 snp_attach() 


snp_attach () is used for specifying the identity acaller 
wishes to be authenticated as to its peer and the name of 
the intended peer. The name structure name-_s is of the 
following form: (This structure is automatically generated 
by rpcgen from a XDR structure.) 


struct name_s { 


Struct { 
u_int np_len; /* Length of the name */ 
char *np_val; /* The actual name “ye 
} np; 


}3 


If invoked by a server, peer_name may be set 
to NULL, in which case connection from any client 
would be accepted. Once a connection is established, 
the identity of the client can be discovered by call- 
ing snp_getpeerid() (see below). snp_attach () 
must be invoked before connection establishment, if secure 
communication is desired. 


5.1.4 snp_listen() 


The function allows its caller to specify the maximum al- 
lowed backlog of connection requests. It has identical 


semantics as listen(), except it takes an SNP han- 
dle. Typically, a caller of snp_listen() is a server. 
This function can only be used on an SNP_STREAM or 
SOCK_STREAM connection. 


5.2 Connection Establishment 


consists of 
they are 


The second class of functions 
Snp-_connect() and snp_accept(); 
mostly used for stream connections. 


5.2.1 snp_connect () 


For an SNP_STREAM endpoint, this function results in 
the establishment of a connection with a peer if a cor- 
responding snp.accept() is performed by the peer. 
A successful connection also indicates a successful au- 
thentication exchange using the underlying authentication 
protocol. 


In the case of SNP_DGRAM, snp_connect() only 
saves the supplied peer address in an internal SNP struc- 
ture. This address would be assumed to be the destination 
address in all subsequent data transfer unless an explicit 
address is given. No authentication is performed at the 
time of the call; instead, it is performed at the time of the 
first data transfer call. 


ep cS Sg aa ce 
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§.2.2 snp-_accept () 


snp_accept() can be used only on an SNP_STREAM 
or SOCK_STREAM endpoint. It accepts connection re- 
quests and completes them if the authenticated peer 
identity matches the one specified by a previous 
snp_attach().® Successful completion also implies 
that the peer identity has been authenticated, and can be 
discovered using snp.getpeerid(). Furthermore, it 
implies the establishment of a pair of security contexts 
(one at each peer) and the distribution of a session key. 
The return value is a new SNP handle which can be 
used for further communication with the peer. Further 
connection requests can continue to come in on the original 
SNP endpoint. If peer_addr and peer_addr_len are 
non-NULL, they will be filled in appropriately. 


5.3. Data Transfer 


All of the following data transfer functions return the num- 
ber of bytes actually sent or received on success and -1 on 
failure. 


§.3.1 snp-sendto() 


snp_sendto() sends nbytes of data pointed to by 
buf to the peer address specified by the to parameter. 
This function may be used on both stream and datagram 
endpoints. In case of a datagram endpoint, both to and 
to_len must be specified. The data will be sent encrypted 
or signed if the appropriate SNP options have been set 
(see snp_setopt() below). The possible values and 
semantics of flags are the same as those in sendto(). 


§.3.2 snp-_-recvfrom() 


snp_recvfrom() attempts to receive nbytes of data 
and stores them in a buffer pointed to by buf. The address 
and address length of the peer are filled into from and 
from_len respectively, if both of them are non-NULL. 
flags has the same semantics as in the recvfrom(). 
The incoming data is decrypted or verified, depending 
upon the SNP options specified. 


5.3.3 snp-read(), snp-write(), snp-send() 
and snp_recv () 


These calls can only be used on stream endpoints. Their se- 
mantics are essentially similar to their socket counterparts. 
snp_send() and snp_recv() provides additional fea- 
tures (e.g., such as expedited data) that are not available 
with snp_write() and snp_read(). The nature of 
data sent or received depends on the current SNP options. 


8If the peer name specified is NULL, connections from any client is 
accepted. 


5.4 Connection Release 
5.4.1 snp_shutdown() and snp_close() 


These functions have similar semantics as their socket 
counterparts, except they perform the release only after 
they have verified that the release request did originate 
from the correct peer. 


5.5 Utility Routines 


These functions are used for manipulating or retrieving the 
characteristics of an SNP endpoint. 


5.5.1 snp_setopt () 


snp-_setopt () is used to set options available for a regu- 
lar socket as well as those specific to SNP. A new constant, 
SNP, has been introduced for the level parameter. The 
options available at the SNP level are: 
















SNP.OPTIONS. DEFAULT 
SNP_OPTIONS. ENCRYPTED 
SNP_OPTIONS. SIGNED 

SNP_OPTIONS.SEQUENCED 
SNP_OPTIONS.NOTIFY 


Reset all option settings to default 
Encryptall subsequent data 

Sign all subsequent data 

Enforce sequencing on data 
Notify caller on context expiry — do 
not reinitiate authentication 
SNP_OPTIONS.CONTEXT_TIME | Set context expiration time 














Setting SNPLOPTIONS_DEFAULT results in resetting 
all options to their default settings; that is, no encryption, 
no signing and no sequencing. 

Setting SNP_OPTIONS_ENCRYPTED causes subse- 
quent outgoing data to be encrypted. Setting 
SNP_OPTIONS_SIGNED causes subsequent outgoing 
data to be signed. The key to be used for encryption 
and signing is the session key maintained in the current 
security context. Options SNPLOPTIONS_ENCRYPTED 
and SNP_OPTIONS_SIGNED cannot be set at the 
same time. To enforce sequencing of data, option 
SNP_OPTIONS_SEQUENCED should be set. This may 
be used in conjunction with either 
SNP_OPTIONS_ENCRYPTED or 
SNP_OPTIONS_SIGNED. 

When the current security context expires, the SNP layer 
automatically renegotiates a new context. This can be 
disabled by setting SNP_OPTIONS_NOTIFY; in which 
case, the SNP user will be notified of context expiry when 
it performs an SNP call. The duration of a context can be 
set using the SNPLOPTIONS_CONTEXT_TIME option. 

Note that the first five options are toggle flags, while 
the last one requires the context duration to be specified in 
optval. 


5.5.2 snp_perror() and snp_getpeerid() 


snp_perror () performs the same function as the stan- 
dard perror() function, except that it accounts for 
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Figure 6: Underlying Authentication Protocol 
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Figure 7: Control and Data Flow 


SNP-API error codes as well. snp_getpeerid() re- 
trieves the authenticated identity of the peer.° 


°In fact, the unauthenticated identity of the peer is available as soon 


6 Overview of Implementation 


To facilitate discussion of SNP’s implementation, it is 


as the underlying authentication protocol has proceeded beyond a certain helpful to first briefly describe our implementation of GSS- 


point, even if the authentication exchange fails at the end. 


API. The authentication protocol underlying our GSS-API 
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implementation is shown in Figure 6 (J denotes the initia- 
tor, R the responder and AS the authentication server). 
The protocol was initially published in [22], and later veri- 
fied in [20, 23]. The mapping of this protocol to GSS-API 
is quite straightforward, and is described in [23]. The key 
point to note is that the communications with AS (steps 
(CE4)-(CE6)) are completely encapsulated within GSS- 
API, and are not observable by the SNP layer. 

Typically, an SNP-API call is translated into a number 
of GSS-API calls together with calls to the communica- 
tion layer. GSS-API is responsible for generating tokens 
that are to be shipped using the communication layer. In 
simple terms, the main responsibility of the SNP layer is to 
request the right tokens to be generated (according to user 
request and current state) and to ensure they are properly 
transferred to the peer SNP layer. Figure 7 shows the rela- 
tionship between SNP-API calls and GSS-API calls. (The 
bold arrows in Figure 7 correspond to the protocol steps 
in Figure 6.) For example, a call to snp_connect () 
results in two calls to gss_init-_sec_context() as 
well as three calls to the communication layer. 

There are several major considerations in implementing 
SNP. We describe them below: 


e Two types of messages, namely, data and control, 
are transferred between SNP peers. Data messages 
contain user data and correspond to SNP data trans- 
fer calls, while control messages contain information 
related to the operation of the SNP layer (e.g., connec- 
tion establishment request/response) and correspond 
to SNP control calls (e.g., snp-connect ()) and 
functions (e.g., context renegotiation). 


There are two ways these messages can be trans- 
ferred. One is to multiplex them onto a single 
connection, and the other is to create dedicated con- 
nections for each type of messages. We opted for the 
latter because control messages should generally be 
given priority over data messages. Thus, if they are 
to be transferred on the same connection, the under- 
lying communication mechanism must support some 
form of priority message facility. Most existing com- 
munication mechanisms (sockets in particular) do not 
support such priority message processing well.!° The 
two-connections solution avoids the dependence on 
such a mechanism."? 


10TCP does not support out-of-band data. It does support some ele- 
mentary form of urgent data with the urgent bit and the urgent pointer. 
Berkeley socket supports out-of-band data, though the precise semantic 
guarantee is highly implementation-dependent. 

11In some sense, this is arguable because typically, there is no guaran- 
tee on the relative arrival times of messages sent on different connections. 
However, in practice, for connections with the same source and desti- 
nation, the times of arrival closely follow the times of the respective 
sends. 


e The two-connections solution also simplifies buffer- 
ing concerns. Specifically, by always reading from 
the control connection (and responding to it) first, we 
no longer need to buffer all the user data preceding 
a control message if a control action is needed. The 
elimination of extra buffering also improves perfor- 
mance. 


e The use of two connections raises the question of 
the address to which the second connection should 
be bound. Our current implementation always estab- 
lishes the second (i.e., control) connection at a fixed 
offset from the user supplied (i.e., data) connection 
address. If adopted as a convention, this should not 
create any collision problem. 


The main data structure in the SNP layer is the 
snp_struct structure. Its definition is shown in Fig- 
ure 9. The control_sockfdand data_sockfd fields 
contain, respectively, the socket descriptors for the control 
and data connections. The fields cred_list_ptr and 
ctx_list_ptr contain pointers to GSS layer structures 
(see Figure 8). The meanings of most other fields are 
given in the comments. Each call to snp() creates an 
snp_struct structure; the SNP handle returned is an 
index into an internal table of pointers to snp_struct 
maintained by the SNP layer. 

We have only touched upon the main ideas in our im- 
plementation. Most of the details concerning context 
expiration, context renegotiation, etc., have been omit- 
ted due to length limitation. This paper is intended only as 
a preliminary overview. We hope to provide a full account 
in a final report. 


7 Performance 


In this section, we present some performance results of 
our SNP implementation. The measurements were done 
on anetwork of Sun SPARCstations 10/30 running SunOS 
4.1.3. The resolution of the system clock is in the order of 
microseconds.” 

We first calibrate the performance of our cryptographic 
packages. Our DES package is a generic public do- 
main one, while our RSA/MD%S package is from RSAREF 
[4]. Both packages are relatively portable, and are not 
optimized. The calibration allows us to determine the 
overhead introduced by the SNP layer, excluding cryp- 
tographic cost. This provides a better measure of the 
performance of our SNP implementation, because as more 
highly optimized cryptographic packages and hardware 
become available, the cryptographic cost will diminish, 
while the SNP overhead remains constant. 


12The measurement error, however, is much worse because of context 
switching, function call overhead, etc. 
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Figure 8: Data Structures 


struct snp_struct ({ 
int control_sockfd; Control socket desc 
int data_sockfd; Data socket desc 
int family; Params specified in call 
int type; 
int protocol; 
struct sockaddr *local_addr; 
int local_addr_len; 
struct sockaddr *peer_addr; 
int peer_addr_len; 
struct name_s *local_name; 
struct name_s *peer_name; 


Obtained from snp_bind() 


Obtained from snp_connect 
or first data xfer calls 
Obtained from snp_attach() 
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int 
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cred_list_ptr; 
ctx_list_ptr; 
secure_options; 
no_send; 
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Figure 9: SNP Structure 


[| DataLength | _168 | 3128 | _1KB | 2KB | 4KB | 8KB | 16KB | 32KB_] 
[| DESEncryption | 042 | 285 | 525 | 997 | 1915 | 3747] 7519 | 15258 | 









| DES Decryption [0.41 |__294 | 5.40 | 10.19 | 1954 | 3820 | 7724 | 158.78 | 
| DESSign | 036] 054 | 073 | 114] 189 | 346 | 661] 1272 | 
| DES Verify [0.33 |_054 | 073 | iil | 1.89 | 3.42 | 657 | 1275 | 
| RSA S12 Encryption | 541.24 | 542.21 | 346.00 | 35192 | 360.44 | 577.69 | 61854 | 689.76 | 
| RSA S12 Decryption | 5393 | 56.85 _| 39.11 | 6383 | 73.14 | 91.29 | 127.49 | 198.42 | 
| -RSASI2Sign | 340.25_| 540.10 | 340.45_| 340.64 | 344.82 | 54401 | 55025] 351.17] 
| RSA S12 Verify | 53.86 | 5382 | 3420 | 34.49 | 3548 | 57.06 | 60.10 | 6621 | 
[ MDs 0.056 02st 0.43 [079 | 152 | 3.00 | 600 | 1211 | 





Table 1: Cryptographic Performance (in milliseconds) 
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Table 2: Connect and Close Calls (in milliseconds) 


Referring to Table 1, the following observations can be 
made: (1) The performance of both DES (CBC mode) and 
MDS is linear with respect to data size. (2) The perfor- 
mance of RSA is also linear except for small data sizes. 
This is due to the fact that for large data sizes, the RSA 
implementation does not perform “true” RSA encryption. 
Instead, it first generates arandom DES key, then encrypts 
the data with the DES key, and finally encrypts the DES 
key using RSA. 

Our measurements of SNP performance are given in 
Tables 2 and 3. All measurements are for SNP_STREAM; 
similar measurements apply to SNP_DGRAM, and are omit- 
ted. Note also that these measurements are based on the 
use of 512-bit RSA keys (i.e., modulus).!* 

Table 2 shows the timing results for connection estab- 
lishment (i.e., snp-connect ()/snpaccept ()) and 
release (i.e., snp_close()). The Total Time row gives 
the amount of time accounted for by cryptographic and 
XDR operations. The Measured Time row gives the 
observed times in establishing and closing an SNP con- 
nection. The difference between Measured Time and Total 
Time (the SNP Overhead row) gives the overhead intro- 
duced by SNP. The Regular Socket row gives the time it 
takes for the corresponding socket calls to complete. Thus, 
for connection establishment, SNP introduced around 0.2s 
overhead. A major component of this overhead is the extra 
round-trip delay for the communication with the authen- 
tication server and the associated message processing at 
the authentication server. For connection release, the SNP 
overhead is around 16ms. 

Table 3 shows the timing results for data transfer calls 
(specifically for snp_write()). The first two rows give 
the times for a SNP_STREAM connection with the encrypt 
and sign options set, respectively. The third row gives 
the time for a regular SNP_STREAM with no option set, 
whereas the fourth row gives the time using plain sockets. 
The SNP Overhead row gives the overhead introduced by 
SNP. It can be observed that the SNP overhead is minimal. 


13Qur implementation is parametric with respect to key length. We 
can easily switch over to 1024-bit keys. That, however, will slow things 
down significantly. The increase in cost is not linear in key length. 
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Two conclusions can be drawn from these measure- 
ments: (1) The cost of cryptographic operations dominates 
the total cost of SNP. We believe this can be generalized 
to any cryptographic security mechanism. (2) It is pos- 
sible to provide security at the application layer without 
incurring undue overhead, even with an unoptimized im- 
plementation. We expect a streamlined implementation to 
perform even better. 


8 Related Work 


Most existing work on secure network communication is 
focused on the protocol or architecture aspects [3, 9, 15, 
17]; not much has been done concerning a general secure 
application network programming interface. 

The work most relevant to ours includes several secure 
RPC systems: the secure RPC package in [2], Sun se- 
cure RPC [18] and DCE secure RPC [14]. The goals of 
these systems are similar to ours: to provide applications 
transparent access to secure communication. However, 
the models of communication adopted are different. RPC 
assumes an implicit communication model. That is, its 
users do not directly manage communications, but instead 
they deal with high-level abstractions in terms of proce- 
dures. SNP assumes an explicit communication model; 
SNP users are directly responsible for initiating connec- 
tions, sending and receiving data, and closing connections. 
The same difference exists between sockets/TLI and RPC 
styles of network programming. 

Apart from this, the implementation of these RPC sys- 
tems is totally different from ours. For example, they tend 
to be tightly coupled to the underlying protocol (e.g., a 
modified Needham-Schroeder protocol [11] is used in [2], 
Kerberos is used in DCE). Our use of GSS-API provides 
protocol independence. 

A recent paper by Wobber et al. [19] describes an oper- 
ating system interface for supporting authentication. ‘The 
interface is based on a formal theory of a speaks for relation 
[7]. Its concrete implementation contains several interest- 
ing abstract datatypes, e.g., a Prin type that represents 


principals, and an Auth type that represents principals a 
process can speak for. In relating to our work, their inter- 
face can be used as an alternate lower interface for SNP. In 
other words, instead of translating SNP-API calls to GSS- 
API calls, they can be translated to calls to the interface 
in [19]. Such a translation should be quite straightforward 
because of the high level of abstraction supported. A major 
disadvantage of their interface, though, is the lack of com- 
patibility with other security mechanisms, e.g., Kerberos. 
Moreover, their interface has only been implemented on 
the Taos operating system, and is currently not available 
on Unix. 


9 Discussion and Future Work 


We believe SNP represents an important first step toward 
secure network programming for the masses. It is clear 
that many important issues need to be resolved before this 
could be a reality. Some of these issues are: the devel- 
opment of a security infrastructure that provides uniform 
management and distribution of credentials (particularly 
for interdomain authentication), and operating system sup- 
port for basic security concepts such as identity (see [19]). 

One of the other impediments is performance. With 
rapidly improving cryptographic software and hardware, 
this should be a diminishing problem. As demonstrated in 
[5], the speed of a modern RISC-based workstation is al- 
ready quite adequate for most cryptographic computation, 
provided the right algorithms and optimizations are used. 

We are also considering several interesting exten- 
sions to the SNP interface. First, delegation can be 
added. This would involve the addition of two new 
calls: snp-.delegate() for the delegating process and 
snp.assume() for the delegate. Delegation allows a 
delegate to act with the same authority as the delegating 
process. Second, the snp_attach() call can be ex- 
tended to accept identity expressions instead of just simple 
identity specifications. An identity expression can specify 
a combination of identities that would be communicated 
to the peer. 

In terms of implementation, we may try to port SNP 
to other authentication systems conforming to GSS-API. 
Also, the essential ideas of SNP can be adapted to provide 
security at other layers (e.g., transport). The lessons we 
learned in designing and implementing SNP provide useful 
references in such an effort. 

Concerning the design of our interface, we have made 
the compatibility with sockets as one of our top de- 
sign requirements. With our present design, a typical 
socket program can be converted into an SNP program 
by simply adding an snp_attach() call,!* without sig- 
nificantly modifying any of the existing code. Alternate 


14 And also prefixing socket calls with “snp.”. 
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designs with less compatibility are possible. For exam- 
ple, the concept of identity can be promoted to “first class 
citizen” status, replacing completely the use of socket ad- 
dresses. For example, the functions snp_connect () 
and snp_accept () would then become 


int snp_connect ( int snp_ep, 
struct name_s *peer_name ); 
int snp_accept ( int snp_ep, 


struct name_s *peer_name ); 


Another concern in the interface design is user con- 
trol. How much control should a user be given and how 
should it be done? For example, users (with the help of 
an operating system) may wish to explicitly manage cre- 
dentials themselves, or to use their own encryption keys or 
algorithms. Our present design allows very limited user 
control (mainly through snp_setopt () ); this could be 
appropriately extended. 

Finally, there is the question of what the best layer for 
providing security support for network communication is. 
It can be argued that there is no single best layer for this 
purpose. The question then becomes: what is the best 
placement of security functionalities into different layers 
so that the resulting architecture is most general and admits 
least duplication? Much more research is needed to obtain 
an answer. 
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Abstract 


This paper describes the kernel-based imple- 
mentation of POSIX Threads (Pthreads) in the 
DG/UX™ Operating system. The implementation 
achieves time efficiency by using a general-purpose 
trap mechanism, known as a Kernel Function Call 
(KFC), that carries an order of magnitude less over- 
head than a traditional system call. On a 50 MHz 
Motorola MC88110, the implementation can create 
and exit a thread (with the associated context switch) 
in 8.1 microseconds and yield to another thread in 4.0 
microseconds. The implementation also achieves 
space efficiency by paging and decoupling bulky data 
structures. 


The advantages of a kernel-based implementa- 
tion include design simplicity, less code redundancy, 
optimization of global (interprocess) operations, 
avoidance of inopportune preemption, and global 
semantic flexibility. The disadvantage is a monolithic 
design that lacks user-level flexibility. 


1. Introduction 


Threads provide an efficient and convenient 
concurrent programming paradigm for applications 
running on shared-memory multiprocessors. As 
industry-standard thread interfaces such as POSIX 
Threads (Pthreads) [Pos93] find their way into open 
systems, an increasing number of portable applica- 
tions are being written (or rewritten) to exploit 
threads. 


Support for Draft 6 of Pthreads was shipped in 
version 5.4R3.00 of DG/UX"", Data General’s com- 
mercial UNIX® operating system. DG/UX originated 
in 1984 as a rewrite of the UNIX kernel in order to 
support Symmetric Multiprocessing (SMP), full pre- 
emption, and better modularity [Kel89]. 
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Though this paper focuses on the techniques 
that DG/UX uses to implement threads efficiently in 
the kernel, any operating system could apply the same 
techniques to other thread packages or performance- 
critical system calls. Most importantly, none of these 
techniques are specific to DG/UX or POSIX Threads. 


2. Design Overview 


The overriding design goal is to allow standard 
multithreaded applications to map as efficiently as 
possible (both in terms of time and space) onto cur- 
rent and future multiprocessor architectures. In order 
to realize this goal, the implementation employs three 
main features: 


Kernel Function Calls (KFCs) 


KFCs are fast, general-purpose kernel traps 
that allow Pthreads to be implemented simply and 
efficiently in the kernel. 


The implementation uses the same set of KFCs 
to optimize both local (intraprocess) and global 
(interprocess) thread operations. Global operations 
are important today and will become increasingly 
important as applications and machines move towards 
a paradigm of distributed computing. 


Low Memory Overhead 


By keeping the per-thread physical and virtual 
memory consumption to a minimum, the implementa- 
tion can easily support thousands of threads in a 
single process. 


All per-thread data structures are pageable, 
except a 128-byte kernel-level structure. Moreover, 
space-consuming kernel stacks and other transient 
data are tied only to threads that have entered the ker- 
nel for a traditional system call, page fault, or other 
type of full kernel entry. 
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Pthread Groups and Hierarchical CPU Affinity 


These extensions give applications explicit 
control over the scheduling of threads onto the under- 
lying multiprocessor machine. 


The affinity mechanism is hierarchical in that it 
parallels the CPU/cache/memory hierarchy of the 
underlying machine. Applications may group together 
threads working on the same data set, then affix that 
thread group to a set of CPUs sharing the same cache 
or local memory. This feature is particularly important 
for improving cache locality on large SMP or NUMA 
(Non-Uniform Memory Access) machines. If an 
application does not specify its thread grouping or 
affinity, the operating system performs automatic 
thread grouping and affinity assignments. 


The rest of this paper is devoted to describing 
the motivations and details of the implementation 
with respect to the first two features. The last feature 
is described separately [Alf94]. 


3. Application Requirements 


There are two broad classes of thread opera- 
tions: 


1) Local Operations are those operations that occur 
among threads in the same process. Local opera- 
tions do not involve communication with other 
processes and are often referred to as intrapro- 
cess or intra-address-space operations. 


2) Global Operations are those operations that 
occur between threads in two different processes 
or between a thread and the kernel. Global opera- 
tions are often referred to as interprocess or inter- 
address-space operations. 


A thread implementation should optimize both 
types of operations. Because global operations require 
more work, they tend to be somewhat slower than 
local operations. However, a request should only 
incur a cost that is proportional to the service that 
must be delivered. 


3.1. Local Operations 


Ideally, a threaded application spends all of its 
time in user space performing local operations and 
rarely talks to the kernel or to other processes. Exam- 
ples of using local-only operations include situations 
in which 


¢ Database back-end threads perform a parallel 
sort of a database table stored in process mem- 
ory. 


¢ Real-time threads are used to prioritize certain 
events that are handled entirely in user space. 


¢ Simulation software breaks up a problem into 
smaller simulations that run as threads on sepa- 
rate CPUs. 


3.2. Global Operations 


Commercial applications also make extensive 
use of global operations. For data integrity reasons, 
applications are split into multiple processes, and 
shared system services reside in the kernel or in privi- 
leged server processes. The kernel must always 
participate in (safe) global thread operations, as illus- 
trated by the following examples in which 


¢ Database front-end threads go through the ker- 
nel to talk to database back-end threads which, 
in turn, make kernel system calls to transfer 
data to or from permanent storage. 


¢ Real-time threads require kernel-based global 
scheduling in order to minimize response time 
to global asynchronous events. 


¢ Distributed computing services are divided into 
replicated multithreaded processes, which com- 
municate using a Local or Remote Procedure 
Call (RPC) mechanism. 


Clearly, global interactions play an important 
role in current and future multithreaded applications. 
Implementations should optimize both local and glo- 
bal cases. 


4. Previous Work 


This section introduces library-based imple- 
mentations of threads and explains why DG/UX has 
not adopted the same approach. Next, the discussion 
focuses on previous work within DG/UX that ulti- 
mately led to an_ efficient kernel-based 
implementation. 


4.1. Multiplexed Libraries 


In pursuit of optimal performance, numerous 
multiplexed thread libraries have been developed to 
avoid kernel system calls, especially during perfor- 
mance critical operations such as thread creation and 
context switching [Gol90, Pow91, Ste92, Mue93]. 


Multiplexed libraries employ two levels of 
scheduling. User-level threads are multiplexed onto a 
typically smaller number of kernel-level entities (e.g., 
processes or kernel-level threads). In the simplest 
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implementations, all user-level threads are multi- 
plexed onto a single process. In more sophisticated 
implementations, the number of kernel-level entities 
varies with the number of CPUs that are assigned to 
the process. Context switches among kernel-level 
entities involve slower system calls. Context switches 
among user-level threads occur entirely in user space, 
thus optimizing performance. In addition, flexible 
multiplexed libraries [And91b, Mar91] allow applica- 
tions to integrate their own thread schedulers. 


Because multiplexed libraries reside in user 
space, highly tuned library primitives can be used 
only for local thread operations. Global thread opera- 
tions, such as interprocess synchronization and RPCs, 
typically follow significantly slower system call paths 
or use dedicated kernel-entry mechanisms. 


In addition, well-integrated multiplexed librar- 
ies require complex algorithms to bridge the wide gap 
between library and kernel databases. In particular, 
the library and kernel must continuously inform each 
other of the number of runnable threads in user space 
and the number of available CPUs in the kernel. Refer 
to the section on “Kernel vs. User Threads” for a more 
detailed discussion of this and other issues. 


Despite their complexity, multiplexed libraries 
prevail due to the perception that system call overhead 
would dominate the cost of a kernel-based thread 
primitive. 


4.2. System Call Overhead 


This section explains in greater detail why mul- 
tiplexed library-based systems and DG/UX have 
avoided using traditional system calls to implement 
threads. 


Have System Calls Gotten Relatively Slower? 


Recent research in the area of RISC processor 
performance [Ous90, And91a] asserts that operating 
system primitives such as system calls, exceptions, 
and process context switches have not experienced 
the same accelerated speedup as application integer 
and floating-point operations. There are two reasons 
why this observation may hold true for system calls in 
many commercial operating systems: 


1) System calls are typically more memory-inten- 
sive than arithmetic operations, thereby 
introducing more processor stalls in the form of 
cache misses and degraded instruction-level par- 
allelism. 
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2) System call entry has grown more complex due 
to the introduction of additional functionality and 
complexities in the kernel. 


As an example, DG/UX system call overhead 
has increased mainly for the following reasons: 


1) RISC processors have required the kernel to save 
and restore more registers and to do more work 
during system calls than previous CISC proces- 
sors. 


2) The desire to provide more precise execution 
time accounting has resulted in reading hardware 
timers during system call entry and exit. Histori- 
cally, time accounting has been implemented 
using a coarse ten-millisecond clock tick. 


3) An effort to make the kernel more modular and 
portable has introduced more inter-subsystem 
calls during the system call path. 


Reasons (1) and (3) likely apply to many other 
operating systems, while reason (2) may be specific 
only to DG/UX and a few other commercial systems. 


Regardless of the actual implementation, sys- 
tem call overhead is inherently significant due to user 
state saving and restoring, and preparing the caller for 
execution in a fully preemptive SMP kernel. 


Impact on Thread Primitives 


Most local thread primitives execute in micro- 
seconds or tens of microseconds. System call 
overhead, which is on the order of tens of microsec- 
onds, would contribute significantly to the overall cost 
of a kernel-based thread primitive. Roughly speaking, 
a system call imposes the equivalent of (at least) an 
additional thread-to-thread context switch on each 
kernel entry. 


4.3. Down-Sizing System Calls 


Fortunately, system calls are not the only way 
to enter the kernel. Modern RISC and CISC proces- 
sors provide fast trap instructions, which can be used 
to invoke simpler kernel primitives that are an order of 
magnitude faster than traditional system calls. 


For example, on a (relatively slow) 25 MHz 
Motorola MC88100 [Mot91], the cost of trapping into 
the kernel and returning from the trap, including 
proper Processor Status Register (PSR) and Program 
Counter (PC) setup and restoration, is less than one 
microsecond. A null system call on the same machine 
takes approximately 20 microseconds. 
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The actual trap into the kernel is inexpensive. 
The real issue is the costly steps that must be per- 
formed in order to set up for a traditional system call. 
In fact, most of these steps can be avoided by using a 
simpler primitive. 


4.4. DG/UX Extended Operations (XOPs) 


For a number of years, DG/UX has used fast 
kernel traps to implement certain extended operations 
(XOPs) that are not provided directly by the Motorola 
88000 instruction set, such as atomic-increment and 
conditional-store to user memory locations. Though 
these atomic operations could have been implemented 
completely in user space using a lock based on the 
88000’s test-and-set instruction, there was a desire to 
avoid any unbounded delay associated with inoppor- 
tune preemption (e.g., hardware interrupt, page fault, 
process abort) while holding a user-level spin lock or 
mutex. 


XOPs prevent this inopportune preemption by 
using simple traps across the user-kernel protection 
boundary. A null XOP takes less than one microsec- 
ond and scales directly with processor speed. 
Atomicity among multiple CPUs is provided using a 
spin lock inside the kernel. In the 88000 family, there 
is no need to disable CPU interrupts explicitly 
because they are implicitly disabled by the trap. Thus, 
preemption is avoided while executing an XOP and 
holding a spin lock. 


Once the XOP has acquired the spin lock, the 
XOP is free to perform the atomic increment or condi- 
tional store on the user memory location. However, 
this memory location could be paged out or invalid. 
Because the XOP holds a critical spin lock and has 
not fully entered the kernel, it must be prepared to 
deal with a fault when touching the user memory 
location. To this end, a flag is set in per-CPU data 
indicating that an XOP is in progress. If a fault occurs 
while touching user memory, the fault prehandler 
ignores the fault and backs out of the XOP. The failure 
is reported back to the user-space library routine, 
which is then responsible for touching the memory 
location. Touching the memory location causes a nor- 
mal page fault to bring in the memory page. Then the 
library routine retries the XOP; the XOP almost 
always succeeds on the second try. 


4.5. Fast Traps in Other Systems 


Fast traps have been used in numerous other 
Operating systems. In particular, fast traps have been 
successfully exploited for the purpose of speeding up 


Interprocess Communication (IPC) [Ber90, Lie92, 
Wal92, Lie93]. Like XOPs, these traps are specific to 
the primitive that they implement. 


5. Kernel Function Calls (KFCs) 


The powerful combination of fast kernel trap 
and fault interception used in XOPs is at the heart of 
the kernel-based thread implementation. However, 
XOPs do not provide the level of semantic flexibility 
and ease-of-use that thread primitives require. In par- 
ticular, XOPs must be implemented in assembly 
language, the XOP back-out mechanism is cumber- 
some, and XOPs do not check for preemptions before 
leaving the kernel. For these reasons, XOPs were 
abandoned in favor of a more general and usable fla- 
vor of fast kernel trap, known as a Kernel Function 
Call (KFC). 


5.1. KFC Semantics 


As the name implies, a KFC is a function that 
resides in the kernel and is callable from user space. 
Calls to KFCs follow the same parameter passing and 
register saving conventions of the underlying proces- 
sor architecture. Returns from KFCs follow the same 
conventions as UNIX system calls on the underlying 
processor architecture. KFCs carry more overhead 
than an XOP, but still an order of magnitude less over- 
head than a traditional system call. Actual KFC 
overhead is on the order of 20-30 machine instruc- 
tions and scales directly with processor speed. 


KFCs operate in a restricted environment 
where CPU interrupts are disabled, time is still 
charged to user space, and blocking with a kernel 
stack is not permitted. Although these restrictions 
could be lifted in order to implement all system calls 
using KFCs, only the shortest system calls would ben- 
efit noticeably. Also, complete generality would bloat 
KFC overhead and introduce undesirable complexi- 
ties, such as the need to unwind a KFC so that a user 
debugger could manipulate a thread’s user registers. 
For these reasons, as well as the desire to avoid other 
likely additions to the KFC entry and exit paths, 
DG/UX limits KFC usage to performance-critical 
primitives (e.g. Pthreads) that can operate in this 
restricted environment. Other potential uses include 
getpid(), gettimeofday(),and RPCs. 


Appendix A gives a complete list of thread- 
related KFC interfaces. Table 1 illustrates the seman- 
tic differences among XOPs, KFCs, and system calls. 
Note that numerous costly steps are avoided during 
KFCs and XOPs. 
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Aspect of Kernel Entry/Exit 


Enters the kernel using a fast trap 
instruction and leaves the kernel 
using a fast return-from-trap 
instruction 


Switches to system time accounting 
on kernel entry, then back to user 
time accounting on exit 


Prepares the environment for full Jf 
kernel entry and enables interrupts 

Executes with CPU interrupts of} 
disabled 


Can be (and is typically) written in C S lviv| 
Can block the calling thread and Ji Vv 
switch to another thread 
Frees up the kernel stack for reuse J 
when the calling thread is blocked 
Can unblock other threads and check 
for preemption before leaving the eof 
kernel 
Handles faults directly without Jf 
explicit detection 


Backs out of faults that occur while 
touching pageable user or kernel 
memory 

















Backs out of faults and promotes to 
an internal system call to handle the 
fault 


Follows system call return Ji v 
conventions 


Table 1: Semantic Differences 
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5.2. KFC Details 


The following sections describe the general 
KFC mechanism in more detail, with emphasis on 
using KFCs to implement thread primitives. The 
reader may wish to browse or skip these details, then 
return to them after reading the rest of this paper. 


5.2.1. KFC Entry 


The steps in KFC entry proceed as follows for 
all KFCs: 


1) A thread makes a call to a library routine. 


2) The library routine typically performs a few 
steps, then decides to invoke a KFC. The routine 
packages some arguments and traps into the ker- 
nel through a common vector. 


How the arguments are packaged and how the 
trap is performed are architecture-dependent. For 
example, on RISC machines, the arguments are 
typically passed into the kernel via registers. 


On CISC machines, the processor may automati- 
cally copy the arguments into the kernel, or it 
may provide enough registers to hold all of the 
arguments. 


The amount of work that the trap instruction does 
depends on the underlying processor architecture. 
RISC traps are typically faster than CISC traps, 
but CISC traps may perform more steps, such as 
automatically switching to a kernel stack. On 
modern RISC and CISC processors with the same 
clock speed, KFC entry overhead tends to be sim- 
ilar. 


3) Once in the kernel, common KFC entry code 
saves the thread’s user-level return address and 
PSR in per-CPU data. 


Assuming that all KFCs are written in C, the 
entry code saves the user-level stack pointer and 
switches to a kernel-level stack. On CISC archi- 
tectures, these operations may be performed 
automatically by the trap instruction. 


4) Next, the entry code transfers control to the 
actual KFC. On RISC architectures, the argu- 
ments to the KFC are typically already in the 
appropriate registers. On CISC architectures, the 
arguments may need to be pushed onto the kernel 
stack before the call. However, argument pushing 
comes cheaply on modern CISC processors. 


5) The actual KFC looks like a traditional system 
call written in C. 
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5.2.2. KFC Exit 


When the thread has completed its KFC, the 
steps in KFC exit proceed as follows for all KFCs: 


1) From within the KFC, the thread temporarily 
stores its primary and secondary return values to 
per-CPU data variables. Note that the use of per- 
CPU data is allowed because CPU interrupts are 
disabled. 


2) The thread returns naturally from the KFC. The 
return status indicates whether the KFC com- 
pleted normally. 


3) The common KFC exit code reads a different per- 
CPU variable to determine if the calling thread 
should check for priority preemption. This occurs 
if the thread had awakened higher priority threads 
during the KFC. In this case, the calling thread 
yields to another thread and will eventually return 
to this exit code. 


4) The KFC exit code interrogates the return status 
from the KFC. For a normal return, the saved pri- 
mary and secondary return values are returned to 
the library routine. The conventions for returning 
these values are the same as those for system 
calls. For a failure return, the errno value is 
extracted from the return status, and the primary 
and secondary return values are ignored. 


5) The KFC exit code restores the partially saved 
user state, and executes the appropriate instruc- 
tion(s) for returning to user space. 


6) The library routine interrogates the results of the 
KFC in the same way that a library routine would 
interrogate the results of a system call. 


7) The library routine may perform other steps 
before returning to its caller. 


5.2.3. Thread Reschedule 


Some KFCs do more than enter the kernel, per- 
form a few steps, and return. In numerous cases, the 
calling thread must suspend its execution. Examples 
of thread suspension include the following: 


e The calling thread attempts to join (wait for) a 
thread that has not yet terminated. 


¢ The calling thread explicitly yields to another 
thread of equal or better priority. 


¢ The calling thread awaits the release of a mutex 
or the signaling of a condition variable. 


¢ The calling thread sleeps for a specific amount 
of time. 


When a thread suspends, it unwinds the KFC 
that it is executing so that the thread’s register state is 
readily accessible. Then common KFC code saves the 
full register state in a standard mcontext_t struc- 
ture that resides inside the kernel. Next, the thread 
calls into the dispatcher, which switches control to 
another thread. 


Eventually, the original thread will be awak- 
ened and a CPU will continue the thread’s execution. 
The thread’s continuation function [Dra91] restores 
the thread’s register state from its mcontext_t 
structure and resumes the KFC where it left off. 


5.2.4. KFC Fault Detection 


Like XOPs, KFCs must back out of faults that 
occur while referencing pageable kernel or user mem- 
ory. There are two main reasons for backing out: 


1) The KFC operates in a restricted kernel environ- 
ment with CPU interrupts disabled. Full fault 
handling is not permitted in this environment. 


2) The KFC typically holds a critical spin lock. If 
the KFC were to allow the fault to be processed, 
the spin lock could remain held for an indefinite 
period of time. This could lead to deadlock or 
extremely long latencies for other threads in the 
same process. Fault detection actually eliminates 
long latencies associated with critical locks that 
are used to implement threads. 


Faulting on Kernel and User Memory 


Because mcontext_t structures can con- 
sume several hundred bytes, depending on the 
processor architecture, the implementation allows 
these structures to be paged. KFCs must be prepared 
to back out of page faults that occur while saving/re- 
storing state to/from the per-thread mcontext_t 
structure located in the kernel. 


In addition, a few thread KFCs need to refer- 
ence user space from inside the kernel. For example, 
when a thread cannot immediately obtain a Pthread 
mutex in user space (i.e., failing the uncontested 
case), the thread invokes a KFC to await the release of 
the mutex. In this contested case, the KFC saves the 
thread’s state and adds the thread to a kernel-level 
synchronization queue associated with the mutex. 
However, due to the potential for certain race condi- 
tions, the KFC must re-check the user-level mutex to 
see if the mutex has been released before actually put- 
ting the thread to sleep. 
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The KFC, while attempting to touch the mutex, 
must be prepared to back out of two types of faults: 


1) The user page holding the mutex is paged out. 


2) The user mutex address passed into the KFC no 
longer refers to valid user memory. Although the 
Pthread library dereferences the mutex address 
and checks the validity of the corresponding 
mutex structure, another thread could errone- 
ously unmap the mutex memory before the 
library invokes the KFC. 


Detecting Faults 


Whenever a KFC needs to touch kernel or user 
memory that could cause a fault, the KFC calls a triv- 
ial assembly language routine to attempt the memory 
operation. Before performing the operation, this com- 
mon routine sets a per-CPU variable to the address of 
back-out code at the bottom of the routine. Under nor- 
mal operation, this per-CPU variable holds a value of 
zero, indicating that faults are not being intercepted. 


As with XOPs, if a fault occurs while touching 
the memory location, the appropriate fault prehandler 
checks the per-CPU variable. Since the variable has a 
nonzero value, the prehandler knows not to proceed 
normally to the full fault handler. Instead, the prehan- 
dler ignores the fault and branches to the back-out 
code at the bottom of the common assembly language 
routine. The back-out code returns to the KFC, indi- 
cating that a failure occurred while touching the 
memory location. If no fault had occurred, the assem- 
bly language routine would have simply returned 
normally after successfully manipulating the memory 
location. 


Unlike XOPs, whenever a KFC access to ker- 
nel or user memory fails, the KFC is promoted 
directly to an internal system call without returning to 
user space, as discussed in the next section. 


5.2.5. KFC Promotion and Demotion 


In most cases, thread operations are completed 
entirely at the KFC level. However, whenever some- 
thing complicated occurs that cannot be handled at the 
KFC level, the KFC must be promoted to an internal 
system call, These rare complications include the fol- 
lowing: 

1) Faulting on user memory, such as a user-level 


mutex. 


2) Page faulting on kernel memory, such as the per- 
thread mcontext_t during register state save 
or restore. 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


3) Failing to allocate memory resources without 
blocking, for example, during thread creation 
when no dead thread is available for “reincarna- 
tion.” 


4) Detecting the presence of a software interrupt 
sent to the calling thread for such events as a sig- 
nal, Pthread cancellation, abort, or stop. 


In each of these cases, the cost of completing 
the operation is high relative to system call overhead. 
Therefore, neither internal system call performance 
nor promotion overhead is critical. 


Implementation 


The basic idea behind KFC promotion is to 
make the environment appear as if the calling thread 
invoked a system call instead of a KFC. The promo- 
tion proceeds as follows: 


1) The detecting KFC stores in per-CPU data the 
address of its associated internal system call and 
parameters for the call. 


Typically, most of the parameters that were 
passed to the KFC are also passed to the internal 
system call. Furthermore, because complications 
can occur during various phases of the KFC, an 
indication of the phase is also passed as an extra 
argument to the internal system call. This indica- 
tor tells the internal system call where to resume 
the operation, for previous phases have already 
been completed successfully. 


2) The KFC returns a special status code, indicating 
that the KFC is being promoted to a system call. 
The common KFC exit code recognizes this spe- 
cial return status and proceeds with the 
promotion instead of returning from the KFC. 
Essentially, the KFC exit code arranges the envi- 
ronment so that it appears as if the calling thread 
has just trapped into the kernel to execute a sys- 
tem call. However, instead of having passed in a 
system call number, the thread has “passed in” 
the kernel address of the internal system call 
function. Also, unlike a failed XOP, the promoted 
KFC does not force the thread to return to user 
space; the promotion takes place completely 
within the kernel. 


3) The calling thread follows the normal system call 
entry path, except that the system call entry code 
understands that this is an internal system call. In 
particular, the handler knows that the internal 
system call address has been “passed in” rather 
than a system call number. 
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4) The internal system call interrogates its argu- 
ments to determine the next phase that needs to 
be completed. Because the thread is running in a 
full-fledged system call, the thread is free to page 
fault on user memory, handle signals, etc. 


5) Once the thread completes the slow phase, it may 
decide to complete other phases, suspend itself, 
or return from the system call. 


If the thread decides to suspend itself (in the same 
way that it would have in the KFC), it demotes 
the system call to the KFC so that it can free up 
its kernel stack for reuse (refer to the section on 
“Transient Data”). When the thread is continued, 
it resumes at the KFC level as if it had never been 
promoted. Note that the KFC may get promoted 
again if it encounters a complication during a 
later phase. 


If, on the other hand, the KFC had been promoted 
to handle one of the last phases of the KFC, the 
thread simply returns from the internal system 
call after completing these phases. When the 
thread gets back to user space, the library routine 
recognizes no difference because KFCs follow 
the same return conventions as system calls. 


Note the following about KFC promotion and 
demotion: 


¢ Only those KFCs that can encounter complica- 
tions are promotable. 


¢ Each promotable KFC is associated with an 
internal system call that mimics some of the 
phases of the KFC. 


¢ Internal system calls tend to be significantly 
larger than their corresponding KFCs because 
the KFCs handle only the common, perfor- 
mance-critical cases. In order to reduce code 
redundancy, internal system calls share most of 
the same support code as their corresponding 
KFCs. With minor effort, it should be possible 
to condense the KFC and the internal system 
call into one routine. 


¢ The system call entry code remains nearly the 
same for both normal system calls and internal 
system calls. The only differences are the speci- 
fication of the actual call routine and, on some 
CISC architectures, where the arguments are 
located. The system call exit code is exactly the 
same for both types of system call. 


¢ Given that promotion is rare, the only good rea- 
son for demotion is kernel stack reuse. 


5.3. Performance Measurements 


The following tables demonstrate that thread 
operations can be implemented efficiently using 
KFCs. All measurements were taken on a dedicated 
Motorola SOMHz MC88110 uniprocessor [Mot91] 
using Draft 6 of Pthreads as shipped in the DG/UX 
5.4R3.00 operating system. Each reported measure- 
ment reflects the average elapsed time per iteration of 
a loop that repeatedly invokes the associated primi- 
tive(s). 


Kernel Entry Overhead 


Table 2 illustrates the costs of XOPs, KFCs, 
and system calls. All calls include the overhead of a 
user-level “wrapper” function in a shared library. 


Note that the null system call time is accurate 
for the DG/UX kernel even though the time exceeds 
the number reported by research projects such as 
[And91a]. This discrepancy is explained by the addi- 
tional steps that DG/UX takes in order to support 
SMP, precise time accounting, multiple APIs, and 


greater internal code modularity. 
Time 
(usec) 










Kernel Entry Type 


Null XOP (basic trap overhead) 
Null KFC (written in C) 1.4 


Null System Call 


Table 2: Kernel Entry Overhead 






Local Operation Overhead 


Table 3 gives times for local (intraprocess) 
thread operations. As in library-based systems, local 
threads are the fastest because they share the same 
CPU time accounting (and timeslice) and get sched- 
uled onto process-local queues. During local context 
switches, the implementation can thus avoid time 
bookkeeping and global scheduling decisions. 


In all cases, local thread operations outperform 
the null system call. The breakdown for thread cre- 
ate/exit was determined using a hardware logic 
analyzer 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 



















Local Operation 


Null thread yield (no context switch) 


Thread Yield (with context switch) 


Thread create/exit (with context switch) 
Breakdown: 

e create (with KFC entry/exit) 

° exit (with KFC entry) 

° context switch (with KFC exit) 
e invoke new thread’s start routine 


Contested mutex lock/unlock (with 
context switch) 


Condition wait/signal (with context 
switch) 


Thread create/exit/join (with both 
context switches) 





Timed condition wait/signal (with 
context switch and one-second time-out 
overhead) 


Table 3: Local Thread Operations 


Task Creation Overhead 


Table 4 gives the time required to create and 
wait for a thread or process to exit. Times are given 
for both locally and globally scheduled Pthreads. 
Unlike local threads, global threads have dedicated 
CPU time accounting and get scheduled onto system- 
wide queues. The local time has been copied from 
Table 3 for comparison. Total time includes the con- 
text switches to and from the new thread or process. 


Time 
(usec) 


7738.9 


Table 4: Task Creation Overhead 











Create/Exit/Wait 
(with both context switches) 






Locally scheduled thread 
Globally scheduled thread 


Process (using fork) 


Context Switch Overhead 


Table 5 gives times for various types of context 
switches involving Pthread condition variables. The 
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overhead for obtaining and releasing the associated 
mutex is included in the condition-variable times. In 
Table 3, a time was given for locally scheduled 
threads using local (intraprocess) condition variables. 
Table 5 repeats the same value and includes those for 
other combinations of locally vs. globally scheduled 
threads and local vs. global condition variables. Table 
5 also includes a time for two processes using a global 
condition variable in shared memory. Note that two 
processes cannot communicate via local condition 
variables because local condition variables are, by 
definition, only available within a single process. 
Lastly, times are included for traditional UNIX (Sys- 
tem V IPC) semaphores in order to illustrate the 
advantage of using condition variables in multitasking 
applications. 





Time 
(usec) 


Local threads, UNIX semaphore 


Table 5: Context Switch Overhead 


Sleep/Wakeup Pair 
(with context switch) 





Comparison with Library Implementations 


Table 6 compares the local-operation perfor- 
mance of DG/UX threads with two _ library 
implementations that were presented at previous 
USENIX conferences. The first is a pure library 
implementation from Florida State University 
[Mue93], which reported best numbers for a 40MHz 
SPARC™ IPX. The second is the SunOS” multi- 
plexed implementation running on a 25MHz SPARC 
1+ [Pow91]. Given the disparity in processor types 
and the fact that thread implementations undergo con- 
tinual tuning, Table 6 does not attempt to declare one 
implementation superior to the others. Nonetheless, if 
one scales the measurements in Table 6 by processor 
speed, one sees that kernel-based implementations 
can perform local Pthread operations in roughly the 
same time as library-based implementations. 
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7 


12.3 51.0 not 
given 


13.2 55.0 | 158.0 


Table 6: Comparative Measurements (not scaled) 





Summary 


KFCs make both local and global thread opera- 
tions efficient. In particular, the performance of local 
and global operations roughly matches the perfor- 
mance of local operations cited by previous library 
implementations. 


6. Minimizing Memory Consumption 


In order to meet the design goal of minimizing 
per-thread memory consumption, the implementation 
pages and decouples data structures wherever possi- 
ble. Since DG/UX already pages kernel data, the 
paging of thread structures did not require special 
work in the virtual memory subsystem. 


6.1. Kernel-Level Thread Structure 


The only per-thread data that cannot be paged 
is the 128-byte main control block structure that 
resides in the kernel address space. An application 
that uses as many as 1,024 threads consumes only 
128KB. Given current trends in memory capacities, 
this amount of overhead is insignificant. Even on a 
16MB workstation, swap space for 1,024 threads (and 
their associated user stacks) is a more limited resource 
than 128KB of physical memory. On a medium-sized 
server (the primary design target), the impact of this 
128KB is even more negligible. 


6.2. User-Level Thread Structure 


Every thread has a small user-level structure 
that resides at the base of the thread’s stack. This 
structure and its enclosing thread stack are entirely 
pageable. The following information is stored in this 


structure: 
e acached copy of the thread’s ID 
e the per-thread errno variable 


* a pointer to the thread’s most recently pushed 
Pthread cleanup handler 


¢ the values for the Pthread thread-specific data 
variables 


6.3. Kernel-Level mcontext_t 


Every thread has a kernel-level mcontext_t 
structure that is used to store the thread’s register state 
when it is suspended in a KFC. This standard UNIX 
structure can consume several hundred bytes and is 
entirely pageable. As discussed earlier, KFCs must be 
prepared to back out of page faults on this structure 
while saving or restoring register state. 


6.4. Transient Data 


When a thread enters the kernel for a tradi- 
tional system call, hardware interrupt, page fault, or 
other type of exception, the thread must run on a ker- 
nel stack. The kernel stack, along with other variables 
that are needed for one trip inside the kernel, make up 
transient data. The size of transient data is approxi- 
mately 8KB; thus, it is too expensive to provide each 
thread with a dedicated transient data area. In addi- 
tion, transient data must be made temporarily non- 
pageable while a thread is running on a CPU. 


In order to reduce the amount of pageable and 
non-pageable transient data, the implementation 
decouples transient data from.threads that are no 
longer using it. As the word “transient” implies, tran- 
sient data are needed for only one trip inside the 
kernel and can migrate from thread to thread. The fol- 
lowing rules summarize transient data allocation 
requirements: 


e When a thread is actually running on a CPU, 
the thread must have non-pageable transient 
data because the thread must be able to enter 
fully into the kernel for any reason (e.g. to pro- 
cess an exception). 


e Once a thread enters fully into the kernel, its 
transient data remain assigned to the thread 
until the thread leaves the kernel or exits. How- 
ever, once the thread is descheduled (e.g. while 
sleeping for a long time in a system call or 
when the system is loaded), the kernel may 
decide to page out the thread’s transient data. 
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¢ Incontrast, when a thread suspends in a KFC, it 
relinquishes its transient data for reuse by other 
threads in the process. This is possible because 
the thread has not entered fully into the kernel, 
has saved its state in its kernel-level 
mcontext_t structure, and no longer needs 
its kernel stack. This rule applies also to global 
(interprocess) operations that are implemented 
using KFCs. 


e When a suspending thread demotes an internal 
system call to a KFC, the suspending thread 
relinquishes its transient data. Similarly, when a 
thread is preempted directly out of user space as 
a result of taking a timeslice or other hardware 
interrupt, the suspending thread demotes to the 
KFC level and frees up its transient data. 


¢ The kernel regulates the number of transient 
data areas in the process based on the current 
number of threads that are making system calls, 
page faults, etc. The kernel regulates the num- 
ber of non-pageable transient data areas based 
on system-wide load and process priorities. 


e Jn a typical application, it is common to have 
many threads sharing a small number of tran- 
sient data areas. 


7. Kernel vs. User Threads 


Given that threads can be implemented effi- 
ciently in the kernel, what are the advantages and 
disadvantages of kernel-based threads vs. library- 
based threads? 


7.1. Design Simplicity 


Kernel-based implementations are inherently 
simpler than multiplexed implementations because 
they employ only one level of thread scheduling; they 
need not treat both user-level threads and kernel-level 
entities. This eliminates the communication gap 
between library and kernel databases that is common 
in multiplexed implementations. The kernel sees all 
threads in the system and can simply schedule runna- 
ble threads directly onto available CPUs with 
minimum latency. Furthermore, because critical data 
structures (including sleep queues) reside in the ker- 
nel address space, an errant application cannot corrupt 
them. 


A kernel-based implementation also reduces 
the amount of code redundancy found in multiplexed 
implementations. For example, there is no need to 
have two sets of user synchronization primitives, one 


for user-level threads and one for kernel-level entities. 
The same holds true for thread creation, signal han- 
dling, thread timeslicing, user debugging support, etc. 


On the other hand, many Pthread semantics 
creep into the kernel. If the thread library does not 
provide support for a particular thread package or 
function, the application writer may need to wait until 
the kernel provides the desired functionality. Natu- 
rally, many thread packages can be emulated 
efficiently on top of existing Pthread or KFC seman- 
tics. Nonetheless, as a kernel-based implementation 
accumulates more semantics, its monolithic design 
increases in complexity. 


7.2. Performance Focus 


Library-based systems focus their performance 
effort in the user-space thread library. Thus, highly- 
tuned thread primitives can be exploited only for local 
thread operations. 


Kernel-based systems focus their performance 
effort in the kernel address space. Highly-tuned KFCs 
can be used for both local and global thread opera- 
tions. For example, one KFC handles thread 
suspensions for both local and global condition vari- 
ables. As a result, when the same type of thread is 
involved, the overhead for a global condition variable 
is only slightly greater than that for a local condition 
variable. 


7.3. Avoiding Inopportune Preemption 


Library-based implementations protect thread 
data structures with critical user-level locks. If a pre- 
emption or page fault occurs while a critical lock is 
held, other thread operations could be delayed for tens 
of milliseconds (or more) until the critical lock is 
finally released. Such latencies hurt performance, 
scale badly with increased numbers of CPUs, and can 
cause real-time applications to miss deadlines. 


While complex schemes exist to work around 
this problem, a kernel-based implementation provides 
a natural solution. During KFCs, interrupts are dis- 
abled, so no timeslice or other preemption can occur. 
Furthermore, if a page fault occurs while touching 
user space (e.g., retrying a mutex before going to 
sleep), the page fault is intercepted, any critical kernel 
spin lock is released, and the KFC is promoted to an 
internal system call to handle the page fault as if it had 
occurred from user space. 
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7.4. Semantic Flexibility 


Generally speaking, library-based implementa- 
tions provide more local semantic flexibility, while 
kernel-based implementations provide more global 
semantic flexibility. 


Unlike the most advanced multiplexed imple- 
mentations [And91b, Mar91], _—kernel-based 
implementations do not allow applications to directly 
manipulate the internal scheduling mechanism or 
integrate their own scheduling policies. Though 
POSIX provides a rich set of basic mechanisms and 
policies on which more complex constructs can be 
built, a flexible library implementation will allow the 
same constructs to be expressed, while maintaining 
excellent integration with the rest of the library. 


On the other hand, a kernel-based implementa- 
tion can efficiently accommodate additional types of 
global semantics. For example, DG/UX allows appli- 
cations to establish affinity relationships between 
groups of threads and sets of CPUs sharing the same 
cache or local memory [AIf94]. The implementation 
maintains local-operation performance within Pthread 
groups, while minimizing penalties for intergroup 
operations. Library-based implementations typically 
inflict greater penalty for CPU affinity because they 
require that user threads be “bound” to slower kernel- 
level entities. 


7.5. User Tools 


In a kernel-based implementation, the kernel 
sees all threads in the system. This simplifies and 
enhances the development of user tools that support 
threads. For example, user debuggers need only con- 
verse with the kernel, and the kernel always supplies a 
consistent set of information. In library-based imple- 
mentations, the debugger must talk to both the kernel 
and the user-level library, and the library could be in 
an inconsistent state. 


DG/UX provides a ps command option that 
displays the detailed status of all threads in the Sys- 
tem. This status information includes, for example, 
the user address of the mutex or condition variable on 
which a thread is blocked. Most library-based imple- 
mentations show information only for the underlying 
kernel entities, which have little correlation with 
library-level threads. 


7.6. The Best of Both Approaches 


Library-based systems could implement their 
kernel-level entities more efficiently using KFCs. 


Such a hybrid would improve the performance of glo- 
bal operations, while retaining flexibility in user 
space. 


Conversely, kernel-based systems could pro- 
vide additional mechanisms found in flexible 
multiplexed implementations. For example, a kernel- 
based implementation could provide Pthread exten- 
sions that allow a parallel language runtime library to 
better regulate the number of active worker threads 
based on the current number of assigned processors. 


8. Conclusions 


This work demonstrates how challenging basic 
assumptions (e.g., that threads cannot be implemented 
efficiently using system calls) can favorably redirect 
the course of a design. This work also illustrates the 
trade-offs involved in implementing operating system 
primitives in kernel space vs. user space. 


For example, when KFCs are combined with 
the conservative use of memory, the result is a simple 
and efficient kernel-based implementation of POSIX 
Threads. While such an implementation forfeits some 
semantic flexibility in user space, it optimizes both 
local and global operations, and can easily accommo- 
date additional global semantics. 


Finally, any operating system could use the 
same general KFC mechanism to implement its own 
thread package or other types of fast kernel primi- 
tives, including the following global operations: 


¢ Local or remote procedure call 


¢ Global asynchronous event notification in the 
form of a newly created thread 


* Interprocess thread creation or migration 


* Quick gettimeofday(), getpid(), etc. 
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Appendix A: Thread KFC Interfaces 


This section lists all KFC interfaces for 
Pthreads and Pthread Groups. Like other UNIX Sys- 
tems [Pow91], DG/UX refers to kernel-level threads 
as LWPs (Light Weight Processes). However, DG/UX 
LWP interfaces differ and, as discussed earlier, 
DG/UX dedicates a very efficient LWP to each thread. 


In all cases, a trivial “wrapper” function in the 
Pthread library invokes the appropriate KFC. Upon 
completion, the KFC returns values and status in the 
Same way as a traditional system call. Because 
DG/UX ships the Pthread library only in shared form 
and does not publish the KFC interfaces, these inter- 
faces can change from revision to revision without 
requiring that users relink their applications. 


The following KFCs create, terminate, join, 
detach, and manipulate the scheduling of threads. In 
the case of __lwp_create(), the new thread 
begins execution at a pre-registered library function 
that invokes the thread’s start routine on the thread’s 
user stack. 


int __lwp_create( 
lwp_sched_info t 
lwp_stack_info_t 


sched_info, 
stack_info, 


void (*start rtn) (), 

void * start_arg, 

lwp_group_id_t group_id); 
int __lwp_join( 

lwpid_t lwpid) ; 
int __lwp_exit ( 

void * exit_status) ; 
int _ lwp detach ( 

lwpid_t lwpid) ; 


int __lwp_set_sched ( 
lwpid_t lwpid, 
lwp_sched_info_t sched_info) ; 


int __lwp_get_sched ( 


lwpid_t lwpid) ; 
int __lwp_yield( 
int to_peers, 
int is_global) ; 
int __lwp_sleep( 
unsigned seconds, 
unsigned nseconds) ; 


The following KFCs manipulate synchroniza- 
tion queues that are used to implement the contested 
cases for mutexes and condition variables. The 
Pthread ‘library does not invoke KFCs in the uncon- 
tested cases. 


int __lwp_sq_alloc( 


lwp_sq_info_t creation_info) ; 


int __lwp_sq_dealloc( 


lwp_sqid_t Sq_id); 
int __lwp_sq_sleep( 

lwp_cond_t * cond ptr, 

lwp_mutex_t * mutex_ptr, 


struct timespec * abstime_ptr, 


lwp_sqid_t Sq_id); 
int __lwp_sq_wakeup ( 
lwp_sqid_t sq_id, 
int is_broadcast) ; 


The following KFCs create, destroy, and 
manipulate DG/UX Pthread Groups, which are dis- 
cussed separately [AIf94]. 


int __lwp_group_create ( 
lwp_sched_info_t sched_info) ; 


int __lwp_group_destroy ( 
lwp_group_id_t lwp_group_id) ; 


int __lwp_group_set_sched ( 
lwp_group_id_t lwp_group_id, 
lwp_sched_info_t sched_info) ; 


int __lwp_group_get_sched ( 
lwp_group_id_t lwp_group_id) ; 


int __lwp_group_get_times ( 
lwp_group_id t lwp_group_id, 
struct timespec * user_time_ptr, 
struct timespec * sys_time_ptr) ; 


UNIX is a registered trademark of Unix Systems 
Laboratories, Inc. 

DG/UxX is a trademark of Data General Corporation. 
SunOS is a trademark of Sun Microsystems, Inc. 
SPARC is a trademark of SPARC International, Inc. 
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Using OS Locking Services to Implement a DBMS: 
An Experience Report 


Andrea H. Skarra 
AT&T Bell Laboratories 


Abstract 


The paper describes a black-box analysis of the 
locking facilities in several UNIX-compatible oper- 
ating systems for their ability to support transac- 
tion synchronization. It assesses the facilities for 
their adequacy, flexibility, and performance. Most 
of the operating systems in the study provide ade- 
quate support for a simple two-phase locking trans- 
action system that does not require customized or 
priority-based scheduling of lock requests. The per- 
formance depends on a variety of factors: the 
average execution time for a lock request varies 
directly with the number of concurrent locks in 
the system and indirectly with the number of files 
locked for a given number of lock requests. The 
request time is smaller when the locked files are 
local to the requesting process instead of remote, 
and when a process locks a file’s segments in order 
of adjacency rather than randomly. For the areas 
in which the OS provides inadequate support, the 
paper proposes several specific remedies. 


1 Introduction 


Applications typically interact with a database 
in the context of transactions to maintain con- 
sistency. A transaction is a sequence of opera- 
tions that satisfies all consistency constraints on 
a database, and the database management system 
(DBMS) synchronizes concurrent transactions to 
produce a serializable execution (i.e., one that pro- 
duces the same effect as some serial execution of 
the transactions) [3]. Thus, a database remains 
consistent across repeated and concurrent access 
when an application uses transactions. 

Most commonly, the DBMS uses locking to syn- 
chronize transactions. The transactions request 
locks in a two-phase protocol that guarantees se- 
rializability, and a lock manager services the re- 
quests. Under the protocol, a transaction neither 
releases nor weakens its locks (e.g., from an exclu- 
sive to a share lock on the same object) until after 
it obtains all its locks. 
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The lock manager detects conflict and dead- 
lock among the transactions, and it implements 
a scheduling algorithm for allocating the locks. 
Frequently, the lock manager is a server process: 
transactions send messages to the server to request 
locks, and the server maintains the data structures 
necessary for scheduling and for conflict and dead- 
lock detection. Each such lock server is dedicated 
to a single DBMS. 

We are developing a DBMS that uses the lock- 
ing services of the operating system (OS) instead 
of a dedicated lock server [5]. The OS (i.e., a 
POSIX.1 compatible [6] together with NFS) sup- 
ports fine-grained Read (Share) and Write (Exclu- 
sive) locking at the byte level for both local and 
remote files in a local area network (LAN) via the 
system call fcnt1(). Each transaction requests 
locks from the OS in a two-phase protocol, and it 
commits in a way that guarantees atomicity. In- 
terprocess communication between a client trans- 
action and a database server is replaced by direct 
interaction with the OS to acquire both data items 
and Read/Write locks. 

The advantages of using an OS locking facility 
to a DBMS developer include the following: 


e Ease of implementation 
An OS locking facility is an alternative to a 
dedicated lock server, and it makes the task 
of implementing one unnecessary for the de- 
veloper. 


e Reduced system size and complezity 
The OS facility is already present; it does not 
add to system size as does a dedicated lock 
server. 


e Openness 
An OS locking facility is a service open to 
any application (i.e., the facility is not buried 
within a closed, monolithic system). It can be 
used for synchronizing across applications. 


e Rapid detection of process failure 
The kernel knows sooner than user-level pro- 
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cesses when to release locks due to process 
failure. 


e Potential for performance improvement 
For a given lock manager (i.e., data structures 
and algorithms), an application request for 
a lock is faster when the manager is imple- 
mented in the kernel rather than as a user- 
level process due to fewer context switches. 


The paper assesses the OS locking services in 
terms of support for transaction synchronization. 
It considers the performance of the OS in granting 
lock requests, the ease of implementing a proto- 
col that guarantees serializability, and the ability 
to customize the system for prioritized scheduling 
schemes. The assessment is a black-box analysis; 
it does not consider the actual OS code. More- 
over, it covers only the OS locking services; the 
applicability of other OS capabilities to DBMSs is 
beyond the scope of the paper. 

The contribution of this work is that it iden- 
tifies requirements that OS locking services must 
satisfy to provide adequate and flexible support 
for transaction synchronization. It also identifies 
the areas in which the services provide inadequate 
support, and it proposes several specific remedies. 

The paper first gives an overview of the OS lock- 
ing services as defined by the documentation. It 
then describes the system’s behavior in terms of 
lock allocation, deadlock detection, and schedul- 
ing, as determined empirically in our study, and 
it assesses the system for its ability to support a 
protocol that guarantees serializability. The paper 
then presents the results of a performance study, 
and it concludes with an overall assessment of the 
services together with some recommendations. 


2 QOS Locking Services 


Several commercial OSs provide a locking fa- 
cility through which processes can lock (parts of) 
files on the same host. We consider the follow- 
ing OSs: HP-UX 9.01 (UX), IRIX 5.1.1.2 (IRIX), 
RISC/os 4.52B (RISC), Solaris 2.1 (Solaris1), So- 
laris 2.3 (Solaris3), SunOS 4.1.1 (SunOS), and 
UNIX System V Release 4 (UNIX). The OSs sup- 
port the same locking interface and compatible 
protocols, and their file systems can be integrated 
in a LAN under the Network File System (NFS) 
for a distributed configuration. NFS transparently 
propagates lock requests to remote hosts [2, 11]. 

For the remainder of the paper, the term OS 
refers generically to any of the systems. We dis- 
tinguish OSs only when there is a discernible dif- 
ference in their behavior or performance. 


2.1 Locking 


The OS supports Read and Write lock modes 
with the usual conflict semantics: a Write lock 
conflicts with a Read or Write lock held by a dif- 
ferent process on the same file segment; Read locks 
do not conflict with each other. 

A process requests a lock on an object with or 
without blocking. If it issues a blocking lock re- 
quest that conflicts with some other lock on the ob- 
ject, the OS suspends the process and queues the 
request until the process that holds the conflicting 
lock releases it. If the request is nonblocking in- 
stead, or if queuing would result in deadlock, the 
request returns immediately with an error code. 

The lockable entities in the OS are files. A pro- 
cess can lock all or part of a file; the smallest lock- 
able unit is a byte. A process can lock nonexis- 
tent bytes that are partially or totally beyond the 
physical end-of-file (EOF). It is also possible to 
lock just the EOF itself to prevent creation of new 
slots in the file. 

Each lock is associated with a file and a process. 
When a process terminates, or when it closes a 
file, the OS releases any locks the process holds 
on the file automatically. A process can explicitly 
release locks as well. Locks are not inherited by 
child processes, and they are not transferable to 
another process. 

The kernel manages the locking of local files, 
and it may impose a limit on the number of lock 
requests that it manages at a time. For locking 
remote files, a user-level daemon (lockd) runs on 
each machine in the LAN. The lockd on a specific 
host H handles lock requests from the H kernel for 
files on other hosts, and it handles lock requests 
from lockds on other hosts for files on H. 

To illustrate, suppose a process on host Hj re- 
quests a lock on a file f. If f is local to Hy, the 
kernel alone handles the lock request. If instead f 
is on a different host Ho, the kernel passes the re- 
quest to its local lockd, which passes it to the lockd 
on Ho, which passes it to the Ho kernel. The ker- 
nel grants the request if no other process holds a 
conflicting lock on f; otherwise, it denies or queues 
the request. The result returns to the H; process 
via the same path. 

If a remote server crashes, the lock daemon tries 
to restore locks that were held by processes. If 
it cannot reinstate a particular lock, it sends a 
SIGLOST signal to the process, and the process 
must abort. The OS does not provide a function 
through which a process can find out which locks 
it currently holds. 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


The OS supports both mandatory and advisory 
locks, although POSIX.1 includes only the latter. 
Mandatory locks block file reads and writes until 
conflicting locks are removed, while advisory locks 
are used in protocols that processes voluntarily fol- 
low. The study covers only advisory locks, being 
safer and sufficient for a DBMS. A runaway pro- 
cess that fails to release a mandatory lock can hang 
or crash the system, and unauthorized access can 
be prevented via mechanisms other than locking. 


2.2 Syntax 


The signature of the system call that a process 
uses to request a lock on (a part of) a file is the 
following: 


int fcntl(fd, cmd, arg) 
int fd, cmd, arg; 


where fd is a file descriptor returned by open(), 
arg is a pointer to a structure that specifies the 
location and number of bytes to be locked and the 
desired lock mode (i.e., F_RDLCK (Read), F_WRLCK 
(Write), or FLUNLCK (UnLock)), and cmd is one of 
the following: 


F_SETLK Set or clear the file segment lock | de- 
scribed in arg. If another process holds a lock 
that conflicts with |, fail and return immedi- 
ately. The requesting process must retry to 
get l. 


F_SETLKW Same as F_SETLK, except the process 
blocks until no other process holds a conflict- 
ing lock and / is granted. If blocking the pro- 
cess would result in deadlock, fail and return 
immediately. 


F_GETLK Get a description of the first lock that 
conflicts with J. If such a lock exists, return 
that lock’s description, including the identi- 
fier of the process that holds the lock. Other- 
wise, return F_UNLCK, whether the requesting 
process already holds / or not. 


The OS recognizes special settings for the fields 
in the arg structure that allow a process to lock 
the entire file or to lock the EOF and beyond with 
one fcnt1() invocation. 


3 Deadlock detection 


When a process issues a blocking lock request 
that conflicts with some current lock on the object, 
and queueing the request would result in deadlock, 
the request fails and returns immediately with an 
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P1 P2 P3 
Wlock Rlock Wlock 


Rlock Rlock 
P3 P2 


Figure 1: A deadlock detection example 


error code. The process then decides on an ac- 
tion: it can retry after some delay, it can request 
a different lock, or it can abort. 


We found that the OSs (with NFS) detect dead- 
lock among processes on the same or different ma- 
chines in a LAN, provided the files they access are 
on a single host, and they correctly handle subtle 
cases such as the following. In Figure 1, processes 
P, and P3 hold Write locks on bytes 1 and 3 re- 
spectively; P2 holds a Read lock on byte 2. Ps 
requests a Read lock on bytes 1 and 2, and P2 
requests a Read lock on byte 3. Although P2 is 
waiting for P3 at byte 3, and P3 is waiting for P» 
at byte 2, deadlock is not actually present. P3 is 
really waiting for P, at byte 1. Accordingly, the 
OS suspends both P2 and P3, and does not refuse 
either request due to deadlock. When P,; termi- 
nates, P3; gets its locks and continues processing, 
and when P3 terminates, P. gets its lock. 


Importantly, however, we found that the OSs 
(with NFS) do not detect deadlock across files on 
different machines, regardless of whether the pro- 
cesses are on the same or different machines. An 
application’s data must reside on the same ma- 
chine, or the data must be partitioned such that 
no transaction accesses data on more than one ma- 
chine. Otherwise, processes may hang. 


For example, suppose H; and Hg are hosts in 
a LAN, and P,; and FP are processes that run on 
H, but access files on both H; and Ho. If Rh 
Write-locks a file on H,, and P2. Write-locks a file 
on He, then both processes hang if each then re- 
quests a lock on the file locked by the other pro- 
cess. Deadlock across files on different hosts goes 
undetected, regardless of where the processes exe- 
cute: the processes still block if P2 runs on H2 or 
if both processes run on another host H3. 
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4 Lock request scheduling 


A scheduler is an algorithm that manages the 
lock request wait-queue for each object. It estab- 
lishes the priority of lock requests in the queue 
with respect to each other and with respect to 
new lock requests. That is, it decides which of any 
queued requests to grant when a process releases 
locks, and it decides whether a new lock request 
conflicts with any queued requests. 


We found that the OS uses a FIFO/Reader’s- 
Priority scheduling algorithm. It processes an ob- 
ject’s wait-queue in first-in-first-out order to find 
lock requests that it can grant when a process 
releases a lock on the object. If a Read lock is 
the next grantable request in the queue, however, 
the algorithm grants all the queued Read requests, 
even those that follow Write requests in the queue. 


We also found that the OS does not consider 
an object’s wait-queue when it tests a new lock 
request for conflict; it considers only the current 
locks on the object. Consequently, new Read re- 
quests also have priority over queued Write re- 
quests. A Read request on a Read-locked object 
is always granted, even if an earlier Write lock re- 
quest is still in the queue. 


4.1 Fairness 


In any Reader’s priority scheme, the possibil- 
ity of starvation exists for transactions that re- 
quest Write locks. Moreover, we found that the 
OS does not grant any part of a lock request until 
it can grant the lock on the entire file segment. If 
a process P requests a Write lock on bytes 0-1023 
of a file f, the OS does not grant locks to P for 
any part of the segment if some other process P’ 
holds a Read or Write lock on a bytei € {0..1023}. 
Moreover, the OS does not give P any priority at 
the bytes where there is no conflict at the time 
of the request. If another process P” requests a 
Write lock on another byte j € {0..1023} before 
P gets its lock, the OS grants the P” request, and 
P must wait for P” to terminate as well as P’. P 
could wait forever. 


The DBMS application developer must work 
around the scheduling algorithm and its fairness 
characteristics to get better performance (albeit 
suboptimal), since the scheduler itself cannot be 
changed. The developer might program P to re- 
quest its lock on the file segment one byte at a 
time, for example. The unhappy choice is between 
extra system calls in P or possible starvation. 


4.2 Upgrading locks 


A process may request a Write lock on an ob- 
ject that it already has Read-locked. We found 
that all the OSs but Solaris1 handle lock upgrades 
in a way that supports two-phase locking: the 
scheduler places the Write request at the head of 
the wait-queue and grants it as soon as there are 
no other Readers. It replaces the Read lock with 
the Write lock before it grants any other (queued) 
Write requests. If more than one Reader requests 
an upgrade, the OS returns a deadlock error to all 
but the first request. 

In contrast, Solaris] handles the same request 
as follows: it releases the Read lock and places the 
Write request at the end of the wait-queue. Any 
queued Write requests are granted before the up- 
grade. If two Readers request an upgrade, Solaris1 
grants each in turn; it does not report deadlock. 

An upgrading process cannot tell whether an- 
other process got a Write lock during the upgrade. 
Thus, it cannot tell (without rereading the data) 
whether the data it originally read with the Read 
lock is still valid when it obtains the Write lock. 

Given the Solaris1 scheduler, it is impossible to 
implement a synchronization protocol that guar- 
antees serializable executions, unless Read locks 
are used only when there is zero chance for up- 
date. This strategy substantially reduces potential 
concurrency. 

Parenthetically, the Solaris1 policy is just as in- 
appropriate for any other OS application that is 
using the locks to maintain a cache. 


4.3 Capacity 


We found that several OSs limit the number of 
lock requests that they manage at a time for local 
files (Table 1).1 The other OS capacities are vir- 
tually unlimited: they handled up to 81,920 lock 
requests by concurrent processes on multiple files 
in Our measurements. 

When an OS reaches its limit, fcnt1() returns 
a special error code (ENOLCK) rather than granting 
or queueing new lock requests. The OS accepts 
new requests again when locks are later released. 

Solarisl1, however, malfunctions when the num- 
ber of lock requests exceeds its capacity. Up to the 
limit, the OS grants locks and detects conflict as 
it should. Beyond that, however, fcnt1() returns 
without error to all lock requests except for those 
on bytes that were locked before the OS reached 


1In UNIX, the file system parameter FLCKREC defines the 
capacity. It can be reset to a value between 100 and 2000 
inclusive upon recompiling the kernel [1]. 
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Machine Maximum 
locks 


RISC MIPS Magnum 
UX HP 9000/730 
UNIX AT&T 





Starserver E 486 
Sun SPARC LX 
Sun SPARC 1 


Solarisl 


Solaris3 





Table 1: Experimentally determined capacity of 
several locking facilities 


its limit; to these it returns ENOLCK. As a result, 
Solaris1 grants (i.e., does not delay or deny) con- 
flicting lock requests for any bytes that were not 
locked before it reached its limit. Obviously this 
is a major problem. 

We found that the capacity of an OS comprises 
the combined total of locks currently granted and 
requests that are queued, and it is independent of 
the number of processes and files involved. More- 
over, the limit applies to the files local to the OS: 
a process that executes on host H; and locks files 
on Ho is subject to the capacity of Hz rather than 
Hy. Finally, the limit applies primarily to locks on 
nonadjacent file segments. In Solaris3, a process 
can lock any number of file segments that are adja- 
cent to a previously-locked segment, provided that 
segment is the one among all the locked segments 
with either the highest or lowest file offset. 

The capacity of an OS affects its scheduling be- 
havior. In particular, the scheduling policy be- 
comes retry when an OS reaches its limit. A pro- 
cess whose lock request returns with ENOLCK must 
resubmit the request in order to obtain the lock; 
the scheduler does not absorb the request in a 
wait-queue and apply it later. 

Importantly, processes can become live-locked 
under a retry policy. If two processes each request 
a lock that the other process holds without releas- 
ing any that the other process needs, then they will 
retry forever. If they quit and restart, they may 
generate the same live-lock situation again. In ei- 
ther case, the potential exists for their making no 
forward progress. 


4.4 Priority 


A priority-based scheduling algorithm is one 
that recognizes an application-defined ordering 
over a set of transactions. It uses the ordering to 
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prioritize the wait-queue and to revoke locks. Pri- 
ority scheduling is important in real-time DBMSs 
(8] or in any application where the processes have 
unequal significance. 

The OS supports priority-based scheduling, but 
only in a very limited sense. A process whose lock 
request is denied due to conflict can discover the 
identifiers of processes with conflicting locks; it 
can then send signals to kill lower priority pro- 
cesses. Only in a single-machine environment is 
the mechanism even possible, however, since pro- 
cess identifiers contain no information about the 
host machines. Moreover, processes waiting in the 
queue cannot be identified or reordered; it is only 
possible to preempt lock holders. 

The ability to reorder the wait-queue would be 
very useful in deadlock resolution. Currently, the 
OS resolves deadlock by refusing the latest lock 
request, a reasonable policy when the detection 
algorithm runs every time a lock request is about 
to be queued and all processes have equal priority 
and capabilities. When equality is not the case, 
however, the optimal resolution strategy is likely 
to involve dequeuing a specific process, rather than 
the one that happens to be last. 


5 Lockable entities 


The OS facility supports locking at arbitrarily 
fine granularity, and as a result, it can provide sup- 
port for transaction synchronization. Earlier stud- 
ies criticized prior OS locking schemes for their 
coarse granularity [10]. In addition, the facility 
integrates locking at the level of records, pages, 
and files under a uniform interface. 

The facility has a problem, however, in that it 
associates locks with open file descriptors. A pro- 
cess must open a file to lock it, and if it closes the 
file, the OS automatically releases any locks that 
the process holds on the file. Typically the num- 
ber of open file descriptors per process is limited, 
and a transaction may have to access and lock a 
larger number of files. 

The hard limit on open file descriptors varies 
in the systems we tested from 256 in SunOS to 
2500 in IRIX, with Solaris3, HP-UX, and UNIX 
having a limit of 1024. In most database appli- 
cations, a process is unlikely to lock more than 
2500 files. Some applications, however, will easily 
exceed 256 open file descriptors. For example, a 
process that creates a transaction-consistent copy 
of the database as a backup must open and lock 
all the database files. In a DBMS that supports 
horizontal partitioning (i.e., separating a relation’s 
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records into different files according to their at- 
tribute values), the number of files could be large. 
One application within our company partitions its 
data into over 800 files. 

When it is possible for some transaction to ex- 
ceed the limit, the synchronization protocol must 
resort to using lock files or a listing file / that con- 
tains an entry for every file in the database. A 
transaction T that holds a lock on file f must cre- 
ate a lock file for f or lock the entry for f in | 
before it closes f. To do so, T must first obtain 
a lock on all of f, even if it has only one byte 
locked. Moreover, for every file f that T locks, 
T must test for the existence of a lock file or a 
lock on listing before it requests its first lock on 
f. Both options substantially reduce concurrency, 
and they increase complexity and overhead with 
extra file maintenance. 


6 Downgrading locks 


None of the OSs provide a way to prevent a 
transaction from inadvertently downgrading locks 
and violating serializability. If a transaction locks 
a record in Write mode, and later requests a Read 
lock on the same record, the OS grants the weaker 
lock and releases the stronger one. 

The situation most commonly arises when a 
transaction queries several indices and obtains the 
same record multiple times. For example, a trans- 
action T' queries a relation R via the color index, 
and the query returns a record r. T gets a Write 
lock on r and changes its color. T then queries R 
via the size index, and it again receives r. T' does 
not change r in this case, so it only gets a Read 
lock on r, inadvertently downgrading the Write 
lock it already has. 

A process cannot find out from the OS what 
locks it holds (e.g., before a request for a Read 
lock) to avoid downgrading Write locks. Even if it 
could, however, the extra system calls that the ap- 
proach requires reduce its viability. Transactions 
must either request locks in a way that avoids the 
risk of downgrading, or they must remember which 
file segments they have Write-locked. Neither ap- 
proach is entirely satisfactory. Transactions can 
avoid downgrading if they delay all Write lock 
requests until after all Read-locking is complete 
(e.g., just before commit), but they run a higher 
risk of deadlock. If they keep track of their Write 
locks, they incur more overhead, and they dupli- 
cate information that the OS already stores. 

The downgrading problem can be solved easily 
if the OS simply adds a new lock mode, Read’. A 


request by transaction T for a Read’ lock on an 
object o is the same as a request for a Read lock 
on o, except that if T already holds a Write lock 
on o, the Read’ request does not change it. 


7 Performance 


We completed several performance studies that 
measure the effect of the following variables on the 
time required for a process to obtain locks on a file: 


Lock type: getting Read vs. Write locks 

Fragmentation: locking contiguous vs. noncon- 
tiguous bytes in a file 

Access order: locking noncontiguous bytes in 
ascending order of file location vs. descend- 
ing order vs. random order 

File location: locking local vs. remote files 

Number of files: getting n locks on one file vs. 
ten files (n/10 locks on each) 

Concurrency: one vs. ten concurrent processes 
each getting n locks on the same file 


We performed the studies on the following OSs: 


SunOS: SunOS 4.1.1 on a Sun SPARC 1+ 
Solaris3: Solaris 2.3 on a Sun SPARC 1 
IRIX: IRIX 5.1.1.2 on a SGI Indigo 


The reference for each measurement is a single 
process that requests Read locks from the OS on 
noncontiguous bytes in a single local file (i.e., every 
second byte in ascending order of file location). 
The test process differs from the reference process 
in each case only in the variable being studied. 

We also ran the reference process on the follow- 
ing systems to compare the OS locking services to 
a dedicated lock server that executes in user space: 


SunOS2: SunOS 4.1.1 on a Sun SPARC 2 
EXODUS: Exodus storage manager server 2.2 
on SunOS 4.1.1 and a Sun SPARC 2 


Comparisons that involve remote locking were 
conducted overnight, and for the local locking ex- 
periments, the machine was not shared. In both 
cases, the network load was minimal. We used the 
time() command to measure the system and real 
times of process execution. 

We describe the experimental results in the next 
section. Because the study is a black-box analysis, 
the interpretation of the results is not as specific 
as an interpretation based on the actual OS code 
would be. Throughout the discussion, however, we 
speculate on the nature of the data structure that 
the OS uses to represent the locks (i.e., the locking 
structure). The locking structure for a file must 
represent the ranges of bytes that are locked, by 
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Figure 2: IRIX comparison. The average system time per lock request of the reference process is compared 
to that of the following (Yage of the reference time at 8192 lock requests): (1) a process that gets Write 
locks (no difference), (3) a process that locks ten files (one-tenth of the locks on each) (6.6%), (5) a process 
that locks contiguous bytes (0.6%), (4) one that locks noncontiguous bytes in descending offset order (0.57), 
and (2) one that locks noncontiguous bytes in random order (23.4%). 


which process(es), and in what mode. When the 
OS receives a lock request for a segment in file f, it 
must search f’s locking structure to detect conflict 
with any locks held by other processes on overlap- 
ping segments. If there is no conflict, it enters the 
lock information in the locking structure and re- 
turns. Otherwise, it runs the deadlock detection 
algorithm and either queues or denies the request. 


7.1 Results 


The most marked and consistent results are the 
following: locking contiguous bytes is faster than 
locking noncontiguous bytes, locking local files is 
faster than locking remote files, obtaining n locks 
on one file is slower than obtaining n locks on ten 
files (n/10 locks on each), and the time per lock 
request is smaller when one process obtains n locks 
on a file than when ten concurrent processes each 
obtain n locks on the same file. The effect of the 
other two variables, lock type and access order, 
differs among the OSs. 

To compare processes that are on the same 
machine as the files they lock, we calculate the 
average lock request system time (ST, — STo)/n, 
where ST, is the system time for a process to ob- 
tain n locks, and for the other comparisons, we cal- 
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culate the average lock request real (elapsed) time 
(RT, — RT))/n, where RT,, is the real time for a 
process to obtain n locks. 

We first discuss the variables that reduce the 
lock request time or have no effect, and then we 
consider concurrency and remote locking. We ac- 
company the textual description of the results with 
several figures. For each figure, we encourage the 
reader to first locate the data for the reference pro- 
cess (perhaps marking it with a colored pen) and 
then consider the other data in relation to the ref- 
erence. 


Fragmentation 


We compare the average lock request system 
times of a reference process R to those of another 
process Po that executes on the same machine at 
a separate time. Po gets Read locks on contiguous 
bytes in a single local file (i.e., every byte in or- 
der of file location, such that the region it locks is 
not fragmented), whereas R locks noncontiguous 
bytes. The results are the same whether Po re- 
quests the locks in ascending or descending order 
(Figures 2-4). 

In the capacity measurements for Solaris3 (Sec- 
tion 4.3), we observed that a process that requests 
locks on contiguous bytes in either ascending or 
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Figure 3: SunOS comparison. The average system time per lock request of the reference process is compared 
to that of the following (%age of the reference time at 8192 lock requests): (1) a process that gets Write 
locks (149%), (4) a process that locks ten files (one-tenth of the locks on each) (12.7%), (5) a process that 
locks contiguous bytes (0.6%), (3) one that locks noncontiguous bytes in descending offset order (68.1%), 
and (2) one that locks noncontiguous bytes in random order (84.0%). 


descending order can obtain an unlimited number 
of locks. When the bytes are not contiguous, how- 
ever, the process obtains no more than 510 locks. 
This suggests that the locking structure entry for 
a locked file segment contains fields for first-byte 
and last-byte that are decremented or incremented 
when the process locks an adjacent segment. In- 
deed, a process that initially obtains 509 locks on 
a range of noncontiguous bytes can subsequently 
obtain an unlimited number of locks on bytes that 
are adjacent to either end of the range. If the 
process instead requests locks on some (unlocked) 
bytes in the midst of the range, however, it can 
obtain only one more lock. This suggests that the 
OS modifies the first- and last-byte fields only in 
the locking structure’s endpoint entries. It does 
not modify or merge entries in the middle. 

A locking structure such as this explains the 
marked improvement in Pc’s lock request timings 
as compared to R’s. In particular, a lock request 
on a file f by Po is simply two context switches 
(user — kernel and kernel — user) and an in- 
crement or decrement operation. Indeed, Pc’s 
lock request timings represent a reasonable up- 
per bound for the context-switching overhead in 
fcnt1(), since the computation on the locking 


structure is minimal. 
Access order 


We compare the average lock request system 
times of a reference process R to those of processes 
Pp and Pr which execute on the same machine as 
R at separate times. Like R, Pp and Pr get Read 
locks on noncontiguous bytes in a single local file 
(i.e., every second byte), but R locks bytes in as- 
cending order of file location, while Pp and Pr 
lock bytes in descending and random order respec- 
tively. We generated randomly ordered, noncon- 
tiguous bytes by rounding each number generated 
by rand() to an even integer. 

In both IRIX and SunOS, Pp and Pr are much 
faster than R (Figures 2,3), although the magni- 
tude of the difference is greater in IRIX: Pp’s times 
in IRIX are comparable to those of a process that 
locks contiguous bytes. In Solaris3, however, the 
effect of access order on the lock request time is 
not significant for the number of lock requests that 
are within the OS capacity (Figure 4). 

The results are consistent with a locking struc- 
ture whose entries are ordered by file location, and 
an access algorithm that does a search and inser- 
tion for each lock request; the search locates the 
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Figure 4: Solaris3 comparison. The average system time per lock request of the reference process is not 
markedly different from that of a process that gets (1) randomly ordered locks, (2) Write locks, or (3) locks 
in descending offset order, but it is greater than that of (4) a process that locks ten files (one-tenth of the 
locks on each) (44.8% of the reference time at 510 requests) or (5) a process that locks contiguous bytes 
(26.5% of the reference time at 510 requests). When ten concurrent instances of the reference process request 
locks on the same file (6), the average request time is 190% of the single reference time at 175 locks/process 
(with the largest standard deviation being 0.85 at 25 locks). The processes could obtain no more than 175 


locks apiece without exceeding the OS capacity. 


point in the locking structure where the entry for 
the lock is inserted. Given such a locking struc- 
ture, Pp and Pr are both faster than R due a 
shorter search path. 


Lock type 


We compare the average lock request system 
times of a reference process R to those of another 
process Pw that executes on the same machine at 
a separate time. Pw gets Write locks on noncon- 
tiguous bytes in a single local file (i.e., every second 
byte in ascending order of file location), whereas 
R gets Read locks. 

R is faster than Pw in SunOS, but there is not 
a marked difference between them in either IRIX 
or Solaris3 (Figures 2-4). It is unclear why Write 
locks should be more costly in SunOS. 


Number of files 


We compare the average lock request system 
times of a reference process R to those of another 
process Pp,, that executes on the same machine 
at a separate time. Like R, Pr,, gets Read locks 
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on noncontiguous bytes (i.e., every second byte in 
ascending order of file location), but R requests 
n locks on a single local file, while Pr,, requests 
n/10 locks on each of ten local files. 

Pr,, is faster than R on IRIX, SunOS, and So- 
laris3 (Figures 2-4). It is likely that the OS main- 
tains separate locking structures for different files, 
since locks on different files do not conflict. Main- 
taining separate locking structures yields a perfor- 
mance advantage by reducing the search path for 
each lock request. 

The improvement is not linear in the number of 
files, however. In IRIX, the average time at 8192 
lock requests for Pr,, is 6.6% of R, whereas the 
average time for a process Pp, that divides the 
locks among 20 files is 4.2% of R. In SunOS, Pr,, 
and Pr, are 12.7% and 8.2% of R respectively. 

The average lock request time increases with 
the current total number of lock requests granted 
and managed by the OS. In IRIX, for example, the 
average request time for Pr,, is 9.57 milliseconds 
when the system total is 81920 locks and only 7.92 
milliseconds when it is 8192. 
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Figure 5: Locking a remote SunOS file from IRIX, Solaris3, and SunOS. The average real time per lock 
request of a reference process that locks a local file under SunOS is less than that of a remote process that 
locks the same file. The average time at 8192 lock requests as a percentage of the local process time is 652% 
for a remote SunOS process, 454% for a Solaris3 process, and 253% for an IRIX process. 


The same relationship holds in SunOS, al- 
though we could not measure the average request 
time for Pr,, at a system total of 81920 locks. 
On each of three attempts to get the value, the 
workstation crashed. The lock-requesting process 
ran for approximately 80 minutes real time and 
acquired between 76000 and 77000 locks. 

Clearly, the average lock request time is affected 
not only by the search time but also by the in- 
sertion time. As the locking structures grow and 
use more space, it is reasonable to expect that 
the memory management will become more time- 
consuming. 


File location 


We compare the average lock request real times 
of a reference process R that locks a local file f 
under SunOS to those of processes P;, Ps3, and 
Ps. Like R, they get Read locks on noncontiguous 
bytes in f (ie., every second byte in ascending 
order of file location), but they execute remotely 
at separate times under IRIX, Solaris3, and SunOS 
respectively. 

R executes much more quickly than do the re- 
mote processes (Figure 5). Their average times 
at 8192 lock requests as a percentage of R’s time 
is 253% for P;, 454% for Ps3, and 652% for Ps. 


The results are explained in part by the additional 
context switches and interprocess communication 
associated with the NFS remote locking implemen- 
tation (i.e., lockd Section 2.1). 

To approximate this overhead, we ran Po. on 
each of the remote OSs, a process that Read-locks 
contiguous bytes in f. Recall that lock requests for 
contiguous bytes require so little computation that 
their execution times approximate context switch- 
ing alone (Figures 2-4). At 8192 lock requests, 
the average times for Pc, on IRIX, Solaris3, and 
SunOS are 19.9% of Pr’s time, 22.6% of Ps3’s time, 
and 7.89% of Ps’s time respectively. 

The overhead approximation, however, does not 
entirely account for the request time differences 
between R and the remote processes. In each OS, 
the average real time for the remote process (i.e., 
P;, Ps3, or Ps) is greater the sum of the over- 
head (i.e., Po.) and the local locking time (i.e., 
R). More experimentation is needed to account 
for the difference. 


Concurrency 


We compare the average lock request system 
times of a single reference process R to those of ten 
concurrent instances of the reference process that 
execute on the same machine as RFR at a separate 
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Figure 6: Concurrency in IRIX and SunOS. The average system time per lock request per process of a single 
reference process is less than that of ten concurrent instances of the reference process that request locks on 
the same file. The average time of the concurrent processes in IRIX is 1080% of the single reference time 
at 8192 locks/process (with the largest standard deviation being 1.04 at 4096 locks) and 939% in SunOS at 
5000 locks/process (with the largest standard deviation being 4.56 at 500 locks). 


time and obtain locks on the same file. 

R executes much more quickly than do the con- 
current processes (Figure 6). In IRIX, the av- 
erage time at 8192 lock requests/process is ten- 
times that of R, and in SunOS, the average time 
at 5000 lock requests/process is nine-times that of 
R. Consequently, the real (elapsed) time for the 
ten processes is approximately 100-times that of 
the single process. We could not measure the ef- 
fect of concurrency in Solaris3 to the same extent 
due to its limited locking capacity, but the results 
suggest the same trend (Figure 4). 

When the ten concurrent processes each lock a 
different file rather than the same file, the average 
request times are close to those of a single process 
that executes alone. The OS handles the same 
number of lock requests, but they are spread across 
more files, and hence the locking structure for each 
file is smaller. At 8192 lock requests/process, the 
average request time is 103% of R’s time in IRIX 
and 101% of R’s time in SunOS. 

We also compared the average lock request sys- 
tem times of R to those of ten concurrent pro- 
cesses, where nine processes request Read locks on 
the even-numbered bytes in a file, and one requests 
Write locks on the odd-numbered bytes. The dif- 
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ference in request times between the ten processes 
with all Readers and the ten processes with one 
Writer was negligible, although in SunOS, the pro- 
cess getting the Write locks consistently finished 
first. 


Dedicated lock server 


We compare the average lock request real times 
of the reference process R to those of another pro- 
cess Pr that executes on the same machine at a 
separate time. Pg obtains Read locks from a ded- 
icated lock server in EXODUS, a DBMS that ex- 
ecutes in user space [4], whereas R obtains Read 
locks from SunOS82. 

EXODUS provides automatic page-level lock- 
ing for objects that a transaction reads or modi- 
fies. To isolate the cost associated with just the 
locking mechanism, we used an internal function 
that acquires a file lock fom EXODUS without 
reading in any objects.’ 

Pg gets locks faster from EXODUS than R gets 
them from SunOS2 (Figure 7). R requests locks 
on the same file, however, while Pz requests locks 
on different files. Consequently, the lock alloca- 


The (undocumented) function, rpc_LockFile(), was 
suggested by Mike Zwilling at the University of Wisconsin. 
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Figure 7: A dedicated lock server on SunOS2. The 
average real time per lock request of a reference 
process on SunOS82 is greater than that of a pro- 
cess that obtains Read locks from EXODUS, but 
the EXODUS process is not as fast as a SunOS 
process that locks contiguous bytes. The average 
time at 8192 lock requests as a percentage of the 
reference time is 3.7% for the EXODUS process 
and 0.8% for the other. 


tion algorithm in SunOS2 iterates over a locking 
structure that grows with each new lock request, 
whereas EXODUS simply adds an entry to the 
locking structure: Pg locks a different file with 
each request. 

A fairer comparison would have included a 
SunOS2 process that gets one lock on each of n dif- 
ferent files. Due to the limitation on the number of 
open file descriptors, however, we could not gather 
a sufficient amount of data. So instead we include 
a SunOS process that locks contiguous bytes. Like 
Pg, its requests do not require a locking structure 
search. This process obtains locks faster than Pr. 


7.2 Assessment 


The performance profile of the OS locking ser- 
vice must be improved, especially with regard to 
sequential access, concurrent processing, and re- 
mote file locking, if it is to provide adequate sup- 
port for transaction synchronization in a DBMS. 

Specifically, the lock allocation algorithm 
should handle sequential as well as random access 
to the records in a file gracefully. Transactions fre- 
quently use sequential access to get better paging 
dynamics and to reduce deadlock (i.e., a common 
strategy for deadlock avoidance is to define an ac- 
cess order for data, and for the data within a file, 
the usual ordering is by location). 


Further, a DBMS typically supports many con- 
current processes that access the same data. One 
expects the elapsed time for n concurrent pro- 
cesses to be less than (or at least not much greater 
than) their sequential execution time (i.e., n-times 
the average single process time). A performance 
profile in which the elapsed time is n?-times the 
single process time is unacceptable for transaction 
synchronization. 

Finally, the remote file locking service under 
NFS is very slow, and given the absence of dead- 
lock detection across files on different servers, its 
usefulness is extremely limited. 


8 Amenities 


While using the OS locking facility, we identi- 
fied several additional capabilities that would im- 
prove performance and convenience of use: time- 
outs, the ability to request a set of locks, and a 
transfer function. 


Timeouts 


A process advances in the wait-queue for a lock 
only while it blocks. For critical processes, how- 
ever, we usually want to bound the duration of 
blocking with a timeout. Indeed, timeouts are re- 
quired in a distributed setting where processes ac- 
cess both local and remote files, because the OS 
(with NFS) does not detect deadlock across file 
systems. A developer can program a timeout us- 
ing alarm(), but as a convenience, the OS should 
provide a timeout parameter for fcnt1() as it does 
for select (). 


Requesting a set of locks 


A single lock request covers a single contigu- 
ous region in a single file. For efficiency, the OS 
should provide a function vfcnt1() that gets locks 
on a vector (set) of regions in the same or different 
files. When invoked with blocking, vfcnt1() re- 
turns when it gets all the locks. Without blocking, 
it returns immediately and indicates which locks 
it got (a la select()). 

In addition, vfcntl() could provide an al- 
ternative fairness model for scheduling. In par- 
ticular, a process that locks a region r using 
vficnt1() with r defined as a set of bytes (i.e., 
vicntl({bytesin+r},...)) gets the lock on 
each byte as soon as it is available. 


Transfer function 


We strongly suggest a function through which 
one process transfers (a subset of) its locks to an- 
other. The transfer function is applicable to fault 
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tolerance (e.g., a sick process transfers its locks to 
a replacement process before exitting), scheduling 
(e.g., instead of blocking for a lock request, a pro- 
cess spawns a child, and the child blocks; when it 
gets the lock, the child transfers it to the parent 
process), and nested transactions (e.g., parent and 
child processes transfer locks according to a nested 
transaction protocol) [7]. 


9 Discussion 


The OS provides a locking facility with which 
a DBMS developer can implement a synchroniza- 
tion protocol based on Read/Write locks. It is cur- 
rently useful for applications whose transactions 
have the properties shown in Table 2. 

While the OS facility has several advantages, 
such as openness, decreasing the time to imple- 
ment a DBMS, and reducing the resulting system 
size, it also has several deficiencies that limit its 
applicability. 


9.1 Requirements 


The following are required areas for improve- 
ment: 


e Two-phase locking 
The OS must provide better support for two- 
phase locking, the most common protocol for 
producing serializable executions. The fa- 
cility must handle both the upgrading and 
downgrading of locks as well as the closing 
of files without compromising correctness. 


The Solarisl protocol for upgrades must not 
be used in any OS, since it is inappropri- 
ate for both transaction synchronization and 
cache coherence algorithms. The downgrad- 
ing problem is handled simply by adding the 
new lock mode Read’. 


e Performance 
The performance profile must be improved to 
better support sequential access, concurrent 
processing, and remote file locking. 


e Locking capacity 
The OS locking capacity must be increased in 
those systems that limit it. Even the largest 
of the limited capacities, a maximum of 510 
in Solaris3, is too small for a DBMS. The 
scheduling policy in these cases degrades too 
soon to timeout and retry with the associated 
risk of live-lock. 


An estimate for the number of locks required 
for the database benchmark TPC/A is 1800 
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on a DECstation 5000 class machine (At a 30 
second user response time and 10 transactions 
per second) [9]. 


e Deadlock detection 
The OS should detect deadlock across files on 
different machines to support location inde- 
pendence of data and to reduce the need for 
timeout and retry strategies. 


e Documentation 

The locking facility is inadequately docu- 
mented. A developer cannot determine from 
the documentation whether a synchronization 
protocol based on the facility guarantees se- 
rializability. Moreover, the documentation is 
incorrect in some cases. SunOS, for exam- 
ple, describes its upgrade protocol to be the 
one we found in Solaris1. In reality, however, 
SunOS implements a different protocol that 
does support two-phase locking. 


We also suggest more flexibility and extensibil- 
ity in the following areas: 


e Fault Tolerance/Recovery 
The facility should provide support for recov- 
ery from file server failures. It should allow a 
processes to transfer locks. 


e Scheduling 
The facility should allow customization of the 
scheduler to support priority-based or seman- 
tic scheduling algorithms. Parameters should 
be added to fcnt1() for timeout and sets of 
locks. 


e Nested Transactions 
The facility should support the definition of 
protocols for parent and child processes that 
differ from those that cover unrelated pro- 
cesses to allow a simple implementation of 
nested transactions. 


9.2 Further study 


A performance comparison with a commercial 
DBMS lock manager would be useful and enlight- 
ening. Unfortunately, it requires access to an in- 
ternal, nonpublic interface in the DBMS, and as 
such, it requires the cooperation of the vendor. To 
date, we have not found a vendor interested in the 
comparison. 


9.3. Conclusion 


In conclusion, we are still very interested in the 
prospect of OS support for DBMSs. There is a 
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Properties 


They primarily lock files on the same machine. 


They are mostly readers; writers are rare. 


The segments locked by different transactions do not 


overlap. 


They lock file segments that are contiguous or ran- 


domly ordered. 


Each transaction locks fewer files than its maximum 


number of open file descriptors. 
All transactions have equal priority. 


Rationale 





Local locking is less costly than remote; deadlock is not 
detected across file systems. 


The scheduler is Reader’s priority (caveat: writers to 
shared data may wait a long time). 


Starvation is possible when a transaction requests a 
lock on a file segment, part of which is locked in a con- 
flicting mode by another transaction. 


Contiguous locking is inexpensive, and so is randomly 
ordered locking in some OSs. 


The OS releases a transaction’s locks on a file upon 
closing. 


The scheduler does not support priority-based schemes. 


Table 2: Characteristics of transactions that are well-suited to OS locking 


trend in OS research toward microkernel archi- 
tectures that could allow the kind of customiza- 
tion and flexibility that DBMSs need. Moreover, 
there is interest in the OS research community to 
provide better support for applications. Finally, 
many of the problems we cite have solutions that 
are backward-compatible with the current inter- 
face (e.g., the Read’ lock). 
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Abstract 


This paper presents a comprehensive design over- 
view of the SunOS 5.4 kernel memory allocator. 
This allocator is based on a set of object-caching 
primitives that reduce the cost of allocating complex 
objects by retaining their state between uses. These 
same primitives prove equally effective for manag- 
ing stateless memory (e.g. data pages and temporary 
buffers) because they are space-efficient and fast. 
The allocator’s object caches respond dynamically 
to global memory pressure, and employ an object- 
coloring scheme that improves the system’s overall 
cache utilization and bus balance. The allocator 
also has several statistical and debugging features 
that can detect a wide range of problems throughout 
the system. 


1. Introduction 


The allocation and freeing of objects are among the 
most common operations in the kernel. A fast ker- 
nel memory allocator is therefore essential. How- 
ever, in many cases the cost of initializing and 
destroying the object exceeds the cost of allocating 
and freeing memory for it. Thus, while improve- 
ments in the allocator are beneficial, even greater 
gains can be achieved by caching frequently used 
objects so that their basic structure is preserved 
between uses. 


The paper begins with a discussion of object 
caching, since the interface that this requires will 
shape the rest of the allocator. The next section 
then describes the implementation in detail. Section 
4 describes the effect of buffer address distribution 
on the system’s overall cache utilization and bus 
balance, and shows how a simple coloring scheme 
can improve both. Section 5 compares the 
allocator’s performance to several other well-known 
kernel memory allocators and finds that it is 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 





generally superior in both space and time. Finally, 
Section 6 describes the allocator’s debugging 
features, which can detect a wide variety of prob- 
lems throughout the system. 


2. Object Caching 


Object caching is a technique for dealing with 
objects that are frequently allocated and freed. The 
idea is to preserve the invariant portion of an 
object’s initial state — its constructed state — 
between uses, so it does not have to be destroyed 
and recreated every time the object is used. For 
example, an object containing a mutex only needs 
to have mutex init () applied once — the first 
time the object is allocated. The object can then be 
freed and reallocated many times without incurring 
the expense of mutex destroy() and 
mutex init () each time. An object’s embedded 
locks, condition variables, reference counts, lists of 
other objects, and read-only data all generally qual- 
ify as constructed state. 


Caching is important because the cost of con- 
structing an object can be significantly higher than 
the cost of allocating memory for it. For example, 
on a SPARCstation-2 running a SunOS 5.4 develop- 
ment kernel, the allocator presented here reduced 
the cost of allocating and freeing a stream head 
from 33 microseconds to 5.7 microseconds. As the 
table below illustrates, most of the savings was due 
to object caching: 


ae | See | See 
allocator | + destruction | allocation | init. 
new 0.0 3.8 1.9 
Caching is particularly beneficial in a mul- 
tithreaded environment, where many of the most 
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frequently allocated objects contain one or more 
embedded locks, condition variables, and other con- 
structible state. 


The design of an object cache is 
straightforward: 


To allocate an object: 


if (there’s an object in the cache) 
take it (no construction required); 
else { 
allocate memory; 
construct the object; 
} 


To free an object: 
return it to the cache (no destruction required); 
To reclaim memory from the cache: 


take some objects from the cache; 
destroy the objects; 
free the underlying memory; 


An object’s constructed state must be initial- 
ized only once — when the object is first brought 
into the cache. Once the cache is populated, allo- 
cating and freeing objects are fast, trivial operations. 


2.1. An Example 


Consider the following data structure: 


struct foo { 
kmutex_t 
kcondvar_t foo cv; 
struct bar *foo barlist; 


int foo refcnt; 


foo lock; 


} 3 


Assume that a foo structure cannot be freed until 
there are no _ outstanding references to it 
(foo refcnt == 0) and all of its pending bar 
events (whatever they are) have completed 
(foo barlist == NULL). The life cycle of a 
dynamically allocated foo would be something like 
this: 


foo = kmem_alloc(sizeof (struct foo), 
KM SLEEP) ; 

mutex_init(&foo->foo lock, ...); 

cv_init(&foo->foo cv, ...); 

foo->foo refcnt = 0; 

foo->foo barlist = NULL; 


use f00; 


ASSERT (foo->foo barlist == NULL); 
ASSERT (foo->foo refcnt == 0); 
cv_destroy (&f00->fo0o cv); 
mutex_destroy (&f00->foo_ lock) ; 
kmem free (foo) ; 


Notice that between each use of a foo object we 
perform a sequence of operations that constitutes 
nothing more than a very expensive no-op. All of 
this overhead (i.e., everything other than ‘‘use foo’’ 
above) can be eliminated by object caching. 


2.2. The Case for Object Caching in the 
Central Allocator 


Of course, object caching can be implemented 
without any help from the central allocator — any 
subsystem can have a private implementation of the 
algorithm described above. However, there are 
several disadvantages to this approach: 


(1) There is a natural tension between an object 
cache, which wants to keep memory, and the 
rest of the system, which wants that memory 
back. Privately-managed caches cannot handle 
this tension sensibly. They have limited 
insight into the system’s overall memory needs 
and no insight into each other’s needs. Simi- 
larly, the rest of the system has no knowledge 
of the existence of these caches and hence has 
no way to “‘pull’’ memory from them. 


(2) Since private caches bypass the central alloca- 
tor, they also bypass any accounting mechan- 
isms and debugging features that allocator may 
possess. This makes the operating system 
more difficult to monitor and debug. 


(3) Having many instances of the same solution to 
a common problem increases kernel code size 
and maintenance costs. 


Object caching requires a greater degree of coopera- 
tion between the allocator and its clients than the 
standard kmem_alloc(9F)/kmem_ free (9F) 
interface allows. The next section develops an 
interface to support constructed object caching in 
the central allocator. 
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2.3. Object Cache Interface 


The interface presented here follows from two 
observations: 


(A) Descriptions of objects (name, size, alignment, 
constructor, and destructor) belong in the 
clients — not in the central allocator. The 
allocator should not just ‘“‘know’’ _ that 
sizeof (struct inode) is a useful pool 
size, for example. Such assumptions are brittle 
[Grunwald93A] and cannot anticipate the needs 
of third-party device drivers, streams modules 
and file systems. 


(B) Memory management policies belong in the 
central allocator — not in its clients. The 
clients just want to allocate and free objects 
quickly. They shouldn’t have to worry about 
how to manage the underlying memory 

- efficiently. 


It follows from (A) that object cache creation must 
be client-driven and must include a full specification 
of the objects: 


(1) struct kmem_cache *kmem_cache_create ( 
char *name, 
size t size, 
int align, 


void (*constructor) (void *, size t), 
void (*destructor) (void *, size t)); 


Creates a cache of objects, each of size size, 
aligned on an align boundary. The align- 
ment will always be rounded up to the 
minimum allowable value, so align can be 
zero whenever no special alignment is required. 
name identifies the cache for statistics and 
debugging. constructor is a function that 
constructs (that is, performs the one-time ini- 
tialization of) objects in the cache; destruc- 
tor undoes this, if applicable. The construc- 
tor and destructor take a size argument so 
that they can support families of similar 
caches, e.g. streams messages. 
kmem cache create returns an opaque 
descriptor for accessing the cache. 


Next, it follows from (B) that clients should need 
just two simple functions to allocate and free 
objects: : 
(2) void *kmem_ cache alloc ( 

struct kmem_cache *cp, 

int flags); 


Gets an object from the cache. The object will 
be in its constructed state. flags is either 
KM SLEEP or KM NOSLEEP, indicating 
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whether it’s acceptable to wait for memory if 
none is currently available. 


(3) void kmem_cache_free ( 
struct kmem_cache *cp, 
void *buf); 


Returns an object to the cache. The object 
must still be in its constructed state. 


Finally, if a cache is no longer needed the client can 
destroy it: 


(4) void kmem_cache_ destroy ( 
struct kmem_cache *cp); 


Destroys the cache and reclaims all associated 
resources. All allocated objects must have 
been returned to the cache. 


This interface allows us to build a flexible allocator 
that is ideally suited to the needs of its clients. In 
this sense it is a ‘‘custom’’ allocator. However, it 
does not have to be built with compile-time 
knowledge of its clients as most custom allocators 
do [Bozman84A, Grunwald93A, Margolin71], nor 
does it have to keep guessing as in the adaptive-fit 
methods [Bozman84B, Leverett82, Oldehoeft85]. 
Rather, the object-cache interface allows clients to 
specify the allocation services they need on the fly. 


2.4. An Example 


This example demonstrates the use of object cach- 
ing for the ‘‘foo’’ objects introduced in Section 2.1. 
The constructor and destructor routines are: 


void 
foo _constructor(void *buf, int size) 
{ 

struct foo *foo = buf; 


mutex init (&foo->foo_lock, ...)j; 
cv_init(&foo->foo cv, ...); 
foo->foo refcnt = 0; 

foo->foo barlist = NULL; 


void 
foo destructor(void *buf, int size) 
{ 


struct foo *foo = buf; 


ASSERT (foo->foo barlist == NULL); 
ASSERT (foo->foo refcnt == 0); 
cv_destroy (&f00->f00_cv) ; 

mutex destroy (&f00->foo_lock) ; 
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To create the foo cache: 


foo_cache = kmem_cache_create("foo cache", 


sizeof (struct foo), 0, 


foo_constructor, foo destructor) ; 


To allocate, use, and free a foo object: 


foo = kmem_cache_alloc(foo_ cache, KM SLEEP); 


use f00; 
kmem_cache_free(foo cache, 


This makes foo allocation fast, because the 
allocator will usually do nothing more than fetch an 
already-constructed foo from the cache. 
foo_constructor and foo destructor 
will be invoked only to populate and drain the 
cache, respectively. 


The example above illustrates a beneficial 
side-effect of object caching: it reduces the 
instruction-cache footprint of the code that uses 
cached objects by moving the rarely-executed con- 
struction and destruction code out of the hot path. 


£oo); 


3. Slab Allocator Implementation 


This section describes the implementation of the 
SunOS 5.4 kernel memory allocator, or ‘‘slab allo- 
cator,’’ in detail. (The name derives from one of 
the allocator’s main data structures, the slab. The 
name stuck within Sun because it was more distinc- 
tive than ‘‘object’’ or ‘‘cache.’’ Slabs will be dis- 
cussed in Section 3.2.) 


The terms object, buffer, and chunk will be 
used more or less interchangeably, depending on 
how we’re viewing that piece of memory at the 
moment. 


3.1. Caches 


Each cache has a front end and back end which are 
designed to be as decoupled as possible: 


back end front end 


kmem_cache_grow kmem_cache_alloc 












mem_cache_reap 


The front end is the public interface to the 
allocator. It moves objects to and from the cache, 
calling into the back end when it needs more 
objects. 


The back end manages the flow of real 
memory through the cache. The influx routine 


(kmem_cache_grow()) gets memory from the 
VM system, makes objects out of it, and feeds those 
objects into the cache. The outflux routine 
(kmem_cache_reap()) is invoked by the VM 
system when it wants some of that memory back — 
e.g., at the onset of paging. Note that all back-end 
activity is triggered solely by memory pressure. 
Memory flows in when the cache needs more 
objects and flows back out when the rest of the sys- 
tem needs more pages; there are no arbitrary limits 
or watermarks. Hysteresis control is provided by a 
working-set algorithm, described in Section 3.4. 


The slab allocator is not a monolithic entity, 
but rather is a loose confederation of independent 
caches. The caches have no shared state, so the 
allocator can employ per-cache locking instead of 
protecting the entire arena (kernel heap) with one 
global lock. Per-cache locking improves scalability 
by allowing any number of distinct caches to be 
accessed simultaneously. 


Each cache maintains its own statistics — 
total allocations, number of allocated and free 
buffers, etc. These per-cache statistics provide 
insight into overall system behavior. They indicate 
which parts of the system consume the most 
memory and help to identify memory leaks. They 
also indicate the activity level in various subsys- 
tems, to the extent that allocator traffic is an accu- 
rate approximation. (Streams message allocation is 
a direct measure of streams activity, for example.) 


The slab allocator is operationally similar to 
the ‘‘CustoMalloc’’ [Grunwald93A], ‘‘QuickFit’’ 
[Weinstock88], and ‘‘Zone’’ [VanSciver88] alloca- 
tors, all of which maintain distinct freelists of the 
most commonly requested buffer sizes. The 
Grunwald and Weinstock papers each demonstrate 
that a customized segregated-storage allocator — 
one that has a priori knowledge of the most com- 
mon allocation sizes — is usually optimal in both 
space and time. The slab allocator is in this 
category, but has the advantage that its customiza- 
tions are client-driven at run time rather than being 
hard-coded at compile time. (This is also true of 
the Zone allocator.) 


The standard non-caching allocation routines, 
kmem_alloc(9F) and kmem free(9F), use 
object caches internally. At startup, the system 
creates a set of about 30 caches ranging in size 
from 8 bytes to 9K in roughly 10-20% increments. 
kmem_ alloc() simply performs a 
kmem_cache alloc() from the nearest-size 
cache. Allocations larger than 9K, which are rare, 
are handled directly by the back-end page supplier. 
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3.2. Slabs 


The slab is the primary unit of currency in the slab 
allocator. When the allocator needs to grow a 
cache, for example, it acquires an entire slab of 
objects at once. Similarly, the allocator reclaims 
unused memory (shrinks a cache) by relinquishing a 
complete slab. 


A slab consists of one or more pages of virtu- 
ally contiguous memory carved up into equal-size 
chunks, with a reference count indicating how many 
of those chunks have been allocated. The benefits 
of using this simple data structure to manage the 
arena are somewhat striking: 


(1) Reclaiming unused memory is trivial. When 
the slab reference count goes to zero the associated 
pages can be returned to the VM system. Thus a 
simple reference count replaces the complex trees, 
bitmaps, and coalescing algorithms found in most 
other allocators [Knuth68, Korn85, Standish80]. 


(2) Allocating and freeing memory are fast, 
constant-time operations. All we have to do is 
move an object to or from a freelist and update a 
reference count. 


(3) Severe external fragmentation (unused 
buffers on the freelist) is unlikely. Over time, 
many allocators develop an accumulation of small, 
unusable buffers. This occurs as the allocator splits 
existing free buffers to satisfy smaller requests. For 
example, the right sequence of 32-byte and 40-byte 
allocations may result in a large accumulation of 
free 8-byte buffers — even though no 8-byte buffers 
are ever requested [Standish80]. A segregated- 
storage allocator cannot suffer this fate, since the 
only way to populate its 8-byte freelist is to actually 
allocate and free 8-byte buffers. Any sequence of 
32-byte and 40-byte allocations — no matter how 
complex — can only result in population of the 32- 
byte and 40-byte freelists. Since prior allocation is 
a good predictor of future allocation [Weinstock88] 
these buffers are likely to be used again. 


The other reason that slabs reduce external fragmen- 
tation is that all objects in a slab are of the same 
type, so they have the same lifetime distribution.* 
The resulting segregation of short-lived and long- 
lived objects at slab granularity reduces the likeli- 
hood of an entire page being held hostage due to a 
single long-lived allocation [Barrett93, Hanson90]. 


* The generic caches that back kmem_alloc() area 
notable exception, but they constitute a relatively small fraction 
of the arena in SunOS 5.4 — all of the major consumers of 
memory now use kmem_cache alloc(). 
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(4) Internal fragmentation (per-buffer wasted 
space) is minimal. Each buffer is exactly the right 
size (namely, the cache’s object size), so the only 
wasted space is the unused portion at the end of the 
slab. For example, assuming 4096-byte pages, the 
slabs in a 400-byte object cache would each contain 
10 buffers, with 96 bytes left over. We can view 
this as equivalent 9.6 bytes of wasted space per 
400-byte buffer, or 2.4% internal fragmentation. 


In general, if a slab contains n buffers, then the 
internal fragmentation is at most 1/n; thus the allo- 
cator can actually control the amount of internal 
fragmentation by controlling the slab size. How- 
ever, larger slabs are more likely to cause external 
fragmentation, since the probability of being able to 
reclaim a slab decreases as the number of buffers 
per slab increases. The SunOS 5.4 implementation 
limits internal fragmentation to 12.5% (1/8), since 
this was found to be the empirical sweet-spot in the 
trade-off between internal and external fragmenta- 
tion. 


3.2.1. Slab Layout — Logical 


The contents of each slab are managed by a 
kmem slab data structure that maintains the slab’s 
linkage in the cache, its reference count, and its list 
of free buffers. In turn, each buffer in the slab is 
managed by a kmem bufct1 structure that holds 
the freelist linkage, buffer address, and a back- 
pointer to the controlling slab. Pictorially, a slab 
looks like this (bufctl-to-slab back-pointers not 
shown): 


next slab in cache 






bufctl 


kmem 
bufctl 






one or more pages 
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3.2.2. Slab Layout for Small Objects 


For objects smaller than 1/8 of a page, a slab is 
built by allocating a page, placing the slab data at 
the end, and dividing the rest into equal-size 
buffers: 


joe | mor | | ae | oot [eal 
slab 
Jt 


one page 


Each buffer serves as its own bufctl while on the 
freelist. Only the linkage is actually needed, since 
everything else is computable. These are essential 
optimizations for small buffers — otherwise we 
would end up allocating almost as much memory 
for bufctls as for the buffers themselves. 


The freelist linkage resides at the end of the 
buffer, rather than the beginning, to facilitate debug- 
ging. This is driven by the empirical observation 
that the beginning of a data structure is typically 
more active than the end. If a buffer is modified 
after being freed, the problem is easier to diagnose 
if the heap structure (freelist linkage) is still intact. 


The allocator reserves an additional word for 
constructed objects so that the linkage doesn’t 
overwrite any constructed state. 


3.2.3. Slab Layout for Large Objects 


The above scheme is efficient for small objects, but 
not for large ones. It could fit only one 2K buffer 
on a 4K page because of the embedded slab data. 
Moreover, with large (multi-page) slabs we lose the 
ability to determine the slab data address from the 
buffer address. Therefore, for large objects the 
physical layout is identical to the logical layout. 
The required slab and bufctl data structures come 
from their own (small-object!) caches. A per-cache 
self-scaling hash table provides buffer-to-bufctl 
conversion. 


3.3. Freelist Management 


Each cache maintains a circular, doubly-linked list 
of all its slabs. The slab list is partially sorted, in 
that the empty slabs (all buffers allocated) come 
first, followed by the partial slabs (some buffers 
allocated, some free), and finally the complete slabs 
(all buffers free, refcnt-== 0). The cache’s freelist 
pointer points to its first non-empty slab. Each slab, 
in turn, has its own freelist of available buffers. 
This two-level freelist structure simplifies memory 


reclaiming. When the allocator reclaims a slab it 
doesn’t have to unlink each buffer from the cache’s 
freelist — it just unlinks the slab. 


3.4. Reclaiming Memory 


When kmem_cache free() sees that the slab 
reference count is zero, it does not immediately 
reclaim the memory. Instead, it just moves the slab 
to the tail of the freelist where all the complete 
Slabs reside. This ensures that no complete slab 
will be broken up unless all partial slabs have been 
depleted. 


When the system runs low on memory it asks 
the allocator to liberate as much memory as it can. 
The allocator obliges, but retains a 15-second work- 
ing set of recently-used slabs to prevent thrashing. 
Measurements indicate that system performance is 
fairly insensitive to the slab working-set interval. 
Presumably this is because the two extremes — 
zero working set (reclaim all complete slabs on 
demand) and infinite working-set (never reclaim 
anything) — are both reasonable, albeit suboptimal, 
policies. 


4. Hardware Cache Effects 


Modern hardware relies on good cache utilization, 
sO it is important to design software with cache 
effects in mind. For a memory allocator there are 
two broad classes of cache effects to consider: the 
distribution of buffer addresses and the cache foot- 
print of the allocator itself. The latter topic has 
received some attention [Chen93, Grunwald93B], 
but the effect of buffer address distribution on cache 
utilization and bus balance has gone largely 


unrecognized. 


4.1. Impact of Buffer Address Distribution 
on Cache Utilization 


The address distribution of mid-size buffers can 
affect the system’s overall cache utilization. In par- 
ticular, power-of-two allocators — where all buffers 
are 2” bytes and are 2”-byte aligned — are pes- 
simal.* Suppose, for example, that every inode 
(«300 bytes) is assigned a 512-byte buffer, 512-byte 
aligned, and that only the first dozen fields of an 
inode (48 bytes) are frequently referenced. Then 
the majority of inode-related memory traffic will be 


* Such allocators are common because they are easy to 
implement. For example, 4.4BSD and SVr4 both employ 
power-of-two methods [McKusick88, Lee89]. 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


at addresses between 0 and 47 modulo 512. Thus 
the cache lines near 512-byte boundaries will be 
heavily loaded while the rest lie fallow. In effect 
only 9% (48/512) of the cache will be usable by 
inodes. Fully-associative caches would not suffer 
this problem, but current hardware trends are toward 
simpler rather than more complex caches. 


Of course, there’s nothing special about 
inodes. The kernel contains many other mid-size 
data structures (e.g. 100-500 bytes) with the same 
essential qualities: there are many of them, they 
contain only a few heavily used fields, and those 
fields are grouped together at or near the beginning 
of the structure. This artifact of the way data struc- 
tures evolve has not previously been recognized as 
an important factor in allocator design. 


4.2. Impact of Buffer Address Distribution 
on Bus Balance 


On a machine that interleaves memory across multi- 
ple main buses, the effects described above also 
have a significant impact on bus utilization. The 
SPARCcenter 2000, for example, employs 256-byte 
interleaving across two main buses [Cekleov92]. 
Continuing the example above, we see that any 
power-of-two allocator maps the first half of every 
inode (the hot part) to bus 0 and the second half to 
bus 1. Thus almost all inode-related cache misses 
are serviced by bus 0. The situation is exacerbated 
by an inflated miss rate, since all of the inodes are 
fighting over a small fraction of the cache. 


These effects can be dramatic. On a 
SPARCcenter 2000 running LADDIS under a 
SunOS 5.4 development kernel, replacing the old 
allocator (a power-of-two buddy-system [Lee89]) 
with the slab allocator reduced bus imbalance from 
43% to just 17%. In addition, the primary cache 
miss rate dropped by 13%. 


4,3. Slab Coloring 


The slab allocator incorporates a simple coloring 
scheme that distributes buffers evenly throughout 
the cache, resulting in excellent cache utilization 
and bus balance. The concept is simple: each time 
a new Slab is created, the buffer addresses start at a 
slightly different offset (color) from the slab base 
(which is always page-aligned). For example, for a 
cache of 200-byte objects with 8-byte alignment, the 
first slab’s buffers would be at addresses 0, 200, 
400, ... relative to the slab base. The next slab’s 
buffers would be at offsets 8, 208, 408, ... and so 
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on. The maximum slab color is determined by the 
amount of unused space in the slab. In this exam- 
ple, assuming 4K pages, we can fit 20 200-byte 
buffers in a 4096-byte slab. The buffers consume 
4000 bytes, the kmem slab data consumes 32 
bytes, and the remaining 64 bytes are available for 
coloring. Thus the maximum slab color is 64, and 
the slab color sequence is 0, 8, 16, 24, 32, 40, 48, 
56, 64, 0, 8, ... 


One particularly nice property of this coloring 
scheme is that mid-size power-of-two buffers 
receive the maximum amount of coloring, since 
they are the worst-fitting. For example, while 128 
bytes goes perfectly into 4096, it goes near- 
pessimally into 4096 - 32, which is what’s actually 
available (because of the embedded slab data). 


4.4. Arena Management 


An allocator’s arena management strategy deter- 
mines its dynamic cache footprint. These strategies 
fall into three broad categories: sequential-fit 
methods, buddy methods, and segregated-storage 
methods [Standish80]. 


A sequential-fit allocator must typically search 
several nodes to find a good-fitting buffer. Such 
methods are, by nature, condemned to a large cache 
footprint: they have to examine a significant number 
of nodes that are generally nowhere near each other. 
This causes not only cache misses, but TLB misses 
as well. The coalescing stages of buddy-system 
allocators [Knuth68, Lee89] have similar properties. 


A segregated-storage allocator, such as the 
slab allocator, maintains separate freelists for dif- 
ferent buffer sizes. These allocators generally have 
good cache locality because allocating a buffer is so 
simple. All the allocator has to do is determine the 
right freelist (by computation, by table lookup, or 
by having it supplied as an argument) and take a 
buffer from it. Freeing a buffer is similarly 
straightforward. There are only a handful of 
pointers to load, so the cache footprint is small. 


The slab allocator has the additional advan- 
tage that for small to mid-size buffers, most of the 
relevant information — the slab data, bufctls, and 
buffers themselves — resides on a single page. 
Thus a single TLB entry covers most of the action. 
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5. Performance 
This section compares the performance of the slab 


allocator to three other well-known kernel memory 
allocators: 
SunOS 4.1.3, based on [Stephenson83], a 
sequential-fit method; 
4.4BSD, based on [McKusick88], a power-of- 
two segregated-storage method; 
SVr4, based on [Lee89], a power-of-two 
buddy-system method. This allocator was 
employed in all previous SunOS 5.x releases. 
To get a fair comparison, each of these allocators 
was ported into the same SunOS 5.4 base system. 


This ensures that we are comparing just allocators, 
not entire operating systems. 


5.1. Speed Comparison 


On a SPARCstation-2 the time required to allocate 
and free a buffer under the various allocators is as 
follows: 


Memory Allocation + Free Costs 
time (sec) 


slab kmem_cache_alloc 





























4.4BSD kmem_alloc 
slab kmem_alloc 
SVr4 kmem_alloc 
SunO$§ 4.1.3 kmem_alloc 


Note: The 4.4BSD allocator offers both functional 
and preprocessor macro interfaces. These measure- 
ments are for the functional version. Non-binary 
interfaces in general were not considered, since 
these cannot be exported to drivers without expos- 
ing the implementation. The 4.4BSD allocator was 
compiled without KMEMSTATS defined (it’s on by 
default) to get the fastest possible code. 


A mutex _enter()/mutex exit () pair 
costs 1.0 sec, so the locking required to allocate 
and free a buffer imposes a lower bound of 2.0 
sec. The slab and 4.4BSD allocators are both very 
close to this limit because they do very little work 
in the common cases. The 4.4BSD implementation 
of kmem alloc() is slightly faster, since it has 
less accounting to do (it never reclaims memory). 
The slab allocator’s kmem_cache alloc() 
interface is even faster, however, because it doesn’t 
have to determine which freelist (cache) to use — 
the cache descriptor is passed as an argument to 
kmem_cache_alloc(). In any event, the differ- 
ences in speed between the slab and 4.4BSD 


allocators are small. This is to be expected, since 
all segregated-storage methods are operationally 
similar, Any good segregated-storage implementa- 
tion should achieve excellent performance. 


The SVr4 allocator is slower than most buddy 
systems but still provides reasonable, predictable 
speed. The SunOS 4.1.3 allocator, like most 
sequential-fit methods, is comparatively slow and 
quite variable. 


The benefits of object caching are not visible 
in the numbers above, since they only measure the 
cost of the allocator itself. The table below shows 
the effect of object caching on some of the most 
frequent allocations in the SunOS 5.4 kemel 
(SPARCstation-2 timings, in microseconds): 


Effect of Object Caching 


allocation with improve- 
type caching | caching ment 
allocb 
dupb 
shalloc 
allocq 
anonmap_alloc 
makepipe 
















All of the numbers presented in this section 
measure the performance of the allocator in isola- 
tion. The allocator’s effect on overall system per- 
formance will be discussed in Section 5.3. 


5.2. Memory Utilization Comparison 


An allocator generally consumes more memory than 
its clients actually request due to imperfect fits 
(internal fragmentation), unused buffers on the free- 
list (external fragmentation), and the overhead of 
the allocator’s internal data structures. The ratio of 
memory requested to memory consumed is the 
allocator’s memory utilization, The complementary 
ratio is the memory wastage or total fragmentation. 
Good memory utilization is essential, since the ker- 
nel heap consumes physical memory. 


An allocator’s space efficiency is harder to 
characterize than its speed because it is workload- 
dependent. The best we can do is to measure the 
various allocators’ memory utilization under a fixed 
set of workloads. To this end, each allocator was 
subjected to the following workload sequence: 


(1) System boot. This measures the system’s 
memory utilization at the console login prompt 
after rebooting. 
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(2) A brief spike in load, generated by the 
following trivial program: 
fork(): fork(); ftork(0); fork{); 
fork(); fork(); fork(); Fork (); 
fd = socket (AF_UNIX, 
sleep(60); 
close (fd) ; 


This creates 256 processes, each of which 
creates a socket. This causes a temporary 
surge in demand for a variety of kernel data 
structures. 


(3) Find. This is another trivial spike- 
generator: : 


find /usr -mount -exec file {} \; 


(4) Kenbus. This is a standard timesharing bench- 
mark. Kenbus generates a large amount of 
concurrent activity, creating large demand for 
both user and kernel memory. 


Memory utilization was measured after each step. 
The table below summarizes the results for a 16MB 
SPARCstation-1. The slab allocator significantly 
outperformed the others, ending up with half the 
fragmentation of the nearest competitor (results are 
cumulative, so the ‘‘kenbus’’ column indicates the 
fragmentation after all four steps were completed): 


a Ta a a 
[allocator | boot | spike | find [kenbus| sim | 


slab 
SunOS 4.1.3 
4 4BSD 

Svr4 












The last column shows the kenbus results, 
which measure peak throughput in units of scripts 
executed per minute (s/m). Kenbus performance is 
primarily memory-limited on this 16MB system, 
which is why the SunOS 4.1.3 allocator achieved 
better results than the 4.4BSD allocator despite 


being significantly slower. The slab allocator 
delivered the best performance by an 11% margin 
because it is both fast and space-efficient. 


To get a handle on real-life performance the 
author used each of these allocators for a week on 
his personal desktop machine, a 32MB 
SPARCstation-2. This machine is primarily used 
for reading e-mail, running simple commands and 
scripts, and connecting to test machines and com- 
pute servers. The results of this obviously non- 
controlled experiment were: 
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SOCK STREAM, 0); 


Effect of One Week of Light Desktop Use 


kernel fragmen- 
allocator heap tation 
slab 


SunOS 4.1.3 
SVr4 








4.4BSD 


These numbers are consistent with the results 
from the synthetic workload described above. In 
both cases, the slab allocator generates about half 
the fragmentation of SunOS 4.1.3, which in turn 
generates about half the fragmentation of SVr4 and 
4.4BSD. 


5.3. Overall System Performance 


The kernel memory allocator affects overall system 
performance in a variety of ways. In previous sec- 
tions we considered the effects of several individual 
factors: object caching, hardware cache and bus 
effects, speed, and memory utilization. We now 
turn to the most important metric: the bottom-line 
performance of interesting workloads. In SunOS 
5.4 the SVr4-based allocator was replaced by the 
slab allocator described here. The table below 
shows the net performance improvement in several 
key areas. 


System Performance Improvement 
with Slab Allocator_ 


_workload _|_ gain | what it measures _ 









parallel compilation 
many-user typing 


Notes: 


(1) DeskBench and kenbus are both memory- 
bound in 16MB, so most of the improvement 
here is due to the slab allocator’s space 
efficiency. 


(2) The TPC-B workload causes very little kernel 
memory allocation, so the allocator’s speed is 
not a significant factor here. The test was run 
on a large server with enough memory that it 
never paged (under either allocator), so space 
efficiency is not a factor either. The 4% per- 
formance improvement is due solely to better 
cache utilization (5% fewer primary cache 
misses, 2% fewer secondary cache misses). 


95 





96 


(3) Parallel make was run on a large server that 
never paged. This workload generates a lot of 
allocator traffic, so the improvement here is 
attributable to the slab allocator’s speed, object 
caching, and the system’s lower overall cache 
miss rate (5% fewer primary cache misses, 4% 
fewer secondary cache misses). 


(4) Terminal server was also run on a large server 
that never paged. This benchmark spent 25% 
of its time in the kernel with the old allocator, 
versus 20% with the new allocator. Thus, the 
5% bottom-line improvement is due to a 20% 
reduction in kernel time. 


6. Debugging Features 


Programming errors that corrupt the kernel heap — 
such as modifying freed memory, freeing a buffer 
twice, freeing an uninitialized pointer, or writing 
beyond the end of a buffer — are often difficult to 
debug. Fortunately, a thoroughly instrumented ker- 
nel memory allocator can detect many of these 
problems. 


This section describes the debugging features 
of the slab allocator. These features can be enabled 
in any SunOS 5.4 kernel (not just special debugging 
versions) by booting under kadb (the kernel 
debugger) and setting the appropriate flags.* When 
the allocator detects a problem, it provides detailed 
diagnostic information on the system console. 


6.1. Auditing 


In audit mode the allocator records its activity in a 
circular transaction log. It stores this information in 
an extended version of the bufctl structure that 
includes the thread pointer, hi-res timestamp, and 
stack trace of the transaction. When corruption is 
detected by any of the other methods, the previous 
owners of the affected buffer (the likely suspects) 
can be determined. 


6.2. Freed-Address Verification 


The buffer-to-bufctl hash table employed by large- 
object caches can be used as a debugging feature: if 


* The availability of these debugging features adds no cost 
to most allocations. The per-cache flag word that indicates 
whether a hash table is present — i.e., whether the cache’s 
objects are larger than 1/8 of a page — also contains the 
debugging flags. A single test checks all of these flags 
simultaneously, so the common case (small objects, no 
debugging) is unaffected. 


the hash lookup in kmem_cache free() fails, 
then the caller must be attempting to free a bogus 
address. The allocator can verify all freed addresses 
by changing the “‘large object’’ threshold to zero, 


6.3. Detecting Use of Freed Memory 


When an object is freed, the allocator applies its 
destructor and fills it with the pattern Oxdeadbeef. 
The next time that object is allocated, the allocator 
verifies that it still contains the deadbeef pattern. It 
then fills the object with Oxbaddcafe and applies its 
constructor. The deadbeef and baddcafe patterns are 
chosen to be readily human-recognizable in a 
debugging session. They represent freed memory 
and uninitialized data, respectively. 


6.4. Redzone Checking 


Redzone checking detects writes past the end of a 
buffer. The allocator checks for redzone violations 
by adding a guard word to the end of each buffer 
and verifying that it is unmodified when the buffer 
is freed. 


6.5. Synchronous Unmapping 


Normally, the slab working-set algorithm retains 
complete slabs for a while. In synchronous- 
unmapping mode the allocator destroys complete 
slabs immediately, kmem_slab destroy () 
returns the underlying memory to the back-end page 
supplier, which unmaps the page(s). Any subse- 
quent reference to any object in that slab will cause 
a kernel data fault. 


6.6. Page-per-buffer Mode 


In page-per-buffer mode each buffer is given an 
entire page (or pages) so that every buffer can be 
unmapped when it is freed. The slab allocator 
implements this by increasing the alignment for all 
caches to the system page size. (This feature 
requires an obscene amount of physical memory.) 


6.7. Leak Detection 


The timestamps provided by auditing make it easy 
to implement a crude kernel memory leak detector 
at user level. All the user-level program has to do 
is periodically scan the arena (via /dev/kmem), 
looking for the appearance of new, persistent alloca- 
tions. For example, any buffer that was allocated 
an hour ago and is still allocated now is a possible 
leak. 
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6.8. An Example 


This example illustrates the slab allocator’s response 
to modification of a free snode: 


kernel memory allocator: buffer modified after being freed 
modification occurred at offset 0x18 (Oxdeadbeef replaced by 0x34) 
buffer=ff8eea20 bufctl=ff8efef0 cache: snode_cache 

previous transactions on buffer ff8eea20: 


thread=ff8b93a0 time=T-0.000089 slab=ff8ca8cO cache: snode_cache 
kmem_cache_alloc+f8 

specvp+48 

ufs_lookup+148 

lookuppn+3ac 

lookupname+28 

vn_open+a4 

copen+6c 

syscall+3e8 


thread=ff8b94c0 time=T-1.830247 slab=ff8ca8cO0 cache: snode_cache 
kmem_cache_free+128 ' 
spec_inactive+208 

closef+94 

syscall+3e8 


(transaction log continues at ff31f410) 

kadb[0O]: 

Other errors are handled similarly. These features 
have proven helpful in debugging a wide range of 
problems during SunOS 5.4 development. 


7. Future Directions 


7.1. Managing Other Types of Memory 


The slab allocator gets its pages from segkmem via 
the routines kmem_ getpages () and 
kmem freepages(); it assumes nothing about 
the underlying segment driver, resource maps, trans- 
lation setup, etc. Since the allocator respects this 
firewall, it would be trivial to plug in alternate 
back-end page suppliers. The ‘‘getpages’’ and 
‘‘freepages’’ routines could be supplied as addi- 
tional arguments to kmem_cache_create(). 
This would allow us to manage multiple types of 
memory (e.g. normal kernel memory, device 
memory, pageable kernel memory, NVRAM, etc.) 
with a single allocator. 


7.2. Per-Processor Memory Allocation 


The per-processor allocation techniques of McKen- 
ney and Slingwine [McKenney93] would fit nicely 
on top of the slab allocator. They define a four- 
layer allocation hierarchy of decreasing speed and 
locality: per-CPU, global, coalesce-to-page, and 
coalesce-to-VM-block. The latter three correspond 
closely to the slab allocator’s front-end, back-end, 
and page-supplier layers, respectively. Even in the 
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absence of lock contention, small per-processor 
freelists could improve performance by eliminating 
locking costs and reducing invalidation traffic. 


7.3. User-level Applications 


The slab allocator could also be used as a user-level 
memory allocator. The back-end page supplier 
could be mmap(2) or sbrk(2). 


8. Conclusions 


The slab allocator is a simple, fast, and space- 
efficient kernel memory allocator. The object-cache 
interface upon which it is based reduces the cost of 
allocating and freeing complex objects and enables 
the allocator to segregate objects by size and life- 
time distribution. Slabs take advantage of object 
size and lifetime segregation to reduce internal and 
external fragmentation, respectively. Slabs also 
simplify reclaiming by using a simple reference 
count instead of coalescing. The slab allocator 
establishes a push/pull relationship between its 
clients and the VM system, eliminating the need for 
arbitrary limits or watermarks to govern reclaiming. 
The allocator’s coloring scheme distributes buffers 
evenly throughout the cache, improving the 
system’s overall cache utilization and bus balance. 
In several important areas, the slab allocator pro- 
vides measurably better system performance. 


Acknowledgements 


Neal Nuckolls first suggested that the allocator 
should retain an object’s state between uses, as our 
old streams allocator did (it now uses the slab allo- 
cator directly). Steve Kleiman suggested using VM 
pressure to regulate reclaiming. Gordon Irlam 
pointed out the negative effects of power-of-two 
alignment on cache utilization; Adrian Cockcroft 
hypothesized that this might explain the bus imbal- 
ance we were seeing on some machines (it did). 


I'd like to thank Cathy Bonwick, Roger 
Faulkner, Steve Kleiman, Tim Marsland, Rob Pike, 
Andy Roach, Bill Shannon, and Jim Voll for their 
thoughtful comments on draft versions of this paper. 
Thanks also to David Robinson, Chaitanya Tikku, 
and Jim Voll for providing some of the measure- 
ments, and to Ashok Singhal for providing the tools 
to measure cache and bus activity. 


Most of all, I thank Cathy for putting up with 
me (and without me) during this project. 





97 





98 


References 


[Barrett93] David A. Barrett and Benjamin G. 
Zorn, Using Lifetime Predictors to Improve Memory 
Allocation Performance. Proceedings of the 1993 
SIGPLAN Conference on Programming Language 
Design and Implementation, pp. 187-196 (1993). 


[Boehm88] H. Boehm and M. Weiser, Garbage 
Collection in an Uncooperative Environment. 
Software - Practice and Experience, v. 18, no. 9, pp 
807-820 (1988). 


[Bozman84A] G. Bozman, W. Buco, T. Daly, and 
W. Tetzlaff, Analysis of Free Storage Algorithms -- 
Revisited. IBM Systems Journal, v. 23, no. 1, pp. 
44-64 (1984). 


[Bozman84B] G. Bozman, The Software Lookaside 
Buffer Reduces Search Overhead with Linked Lists. 
Communications of the ACM, v. 27, no. 3, pp. 
222-227 (1984). _ 


[Cekleov92] Michel Cekleov, Jean-Marc Frailong 
and Pradeep Sindhu, Sun-4D Architecture. Revision 
1.4, 1992. 


[Chen93] J. Bradley Chen and Brian N. Bershad, 
The Impact of Operating System Structure on 
Memory System Performance. Proceedings of the 
Fourteenth ACM Symposium on Operating Systems 
Principles, v. 27, no. 5, pp. 120-133 (1993). 


[Grunwald93A] Dirk Grunwald and Benjamin 
Zor, CustoMalloc: Efficient Synthesized Memory 
Allocators. Software - Practice and Experience, v. 
23, no. 8, pp. 851-869 (1993). 


[Grunwald93B] Dirk Grunwald, Benjamin Zorn 
and Robert Henderson, /mproving the Cache Local- 
ity of Memory Allocation. Proceedings of the 1993 
SIGPLAN Conference on Programming Language 
Design and Implementation, pp. 177-186 (1993). 


[Hanson90] David R. Hanson, Fast Allocation and 
Deallocation of Memory Based on Object Lifetimes. 
Software - Practice and Experience, v. 20, no. 1, pp. 
5-12 (1990). 


[Knuth68] Donald E. Knuth, The Art of Computer 
Programming, Vol I, Fundamental Algorithms. 
Addison-Wesley, Reading, MA, 1968. 


[Korn85] David G. Korn and Kiem-Phong Vo, Jn 
Search of a Better Malloc. Proceedings of the 
Summer 1985 Usenix Conference, pp. 489-506. 


[Lee89] T. Paul Lee and R. E. Barkley, A 
Watermark-based Lazy Buddy System for Kernel 
Memory Allocation. Proceedings of the Summer 
1989 Usenix Conference, pp. 1-13. 


[Leverett82] B. W. Leverett and P. G. Hibbard, An 
Adaptive System for Dynamic Storage Allocation. 
Software - Practice and Experience, v. 12, no. 3, pp. 
543-555 (1982). 


[Margolin71] B. Margolin, R. Parmelee, and M. 
Schatzoff, Analysis of Free Storage Algorithms. 
IBM Systems Journal, v. 10, no. 4, pp. 283-304 
(1971). 


[McKenney93] Paul E. McKenney and Jack 
Slingwine, Efficient Kernel Memory Allocation on 
Shared-Memory Multiprocessors. Proceedings of 
the Winter 1993 Usenix Conference, pp. 295-305. 


[McKusick88] Marshall Kirk McKusick and 
Michael J. Karels, Design of a General Purpose 
Memory Allocator for the 4.3BSD UNIX Kernel. 
Proceedings of the Summer 1988 Usenix Confer- 
ence, pp. 295-303. 


[Oldehoeft85] Rodney R. Oldehoeft and Stephen J. 
Allan, Adaptive Exact-Fit Storage Management. 
Communications of the ACM, v. 28, pp. 506-511 
(1985). 


[Standish80] Thomas Standish, Data Structure 
Techniques. Addison-Wesley, Reading, MA, 1980. 


[Stephenson83] C. J. Stephenson, Fast Fits: New 
Methods for Dynamic Storage Allocation. Proceed- 
ings of the Ninth ACM Symposium on Operating 
Systems Principles, v. 17, no. 5, pp. 30-32 (1983). 


[VanSciver88] James Van Sciver and Richard F. 
Rashid, Zone Garbage Collection. Proceedings of 
the Summer 1990 Usenix Mach Workshop, pp. 1- 
5. 


[Weinstock88] Charles B. Weinstock and William 
A. Wulf, QuickFit: An Efficient Algorithm for Heap 
Storage Allocation. ACM SIGPLAN Notices, v. 
23, no. 10, pp. 141-144 (1988). 


[Zorn93] Benjamin Zorn, The Measured Cost of 
Conservative Garbage Collection. Software - Prac- 
tice and Experience, v. 23, no. 7, pp. 733-756 
(1993). 


Author Information 


Jeff Bonwick is a kernel hacker at Sun. He likes to 
rip out big, slow, old code and replace it with small, 
fast, new code. He still can’t believe he gets paid 
for this. The author received a B.S. in Mathematics 
from the University of Delaware (1987) and an 
M.S. in Statistics from Stanford (1990). He can be 
flamed electronically at bonwick@eng.sun.com. 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


A Better Update Policy 
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Abstract 


Some file systems can delay writing modified 
data to disk, in order to reduce disk traffic and over- 
head. Prudence dictates that such delays be bounded, 
in case the system crashes. We refer to an algorithm 
used to decide when to write delayed data back to 
disk as an update policy. Traditional UNIX® systems 
use a periodic update policy, writing back all 
delayed-write data once every 30 seconds. Periodic 
update is easy to implement but performs quite badly 
in some cases. This paper describes an approximate 
implementation of an interval periodic update policy, 
in which each individual delayed-write block is writ- 
ten when its age reaches a threshold. Interval peri- 
odic update adds little code to the kernel and can 
perform much better than periodic update. In par- 
ticular, interval periodic update can avoid the huge 
variances in read response time caused by using peri- 
odic update with a large buffer cache. 


1. Introduction 

File systems usually cache data and meta-data ina 
main memory buffer cache, in order to improve per- 
formance. When a modification is made, the file sys- 
tem may write the new information to stable storage 
(e.g., disk) immediately, or it may delay the write. 
This leads to a tradeoff: delaying writes reduces the 
load on the disk and system overhead, but the data 
could be lost if the system crashes before the write 
occurs. In many cases, users can tolerate this vul- 
nerability, and welcome the performance advantages 
of delayed writes. 


UNIX® systems have traditionally supported 
delayed writes, from the earliest C-language 
version [13,14] up through 4.3BSD[6] and its 
derivatives. In these systems, a modification of a 
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partially-filled data block results in a delayed write, 
while a modification that fills a block results in an 
immediate, although asynchronous, write. The 
ULTRIX™ operating system can be configured to be 
even more aggressive, delaying all writes of modified 
data. 


Without some bound on the age of a delayed- 
write block, a system crash could cause loss of ar- 
bitrary data. Users would not tolerate this, so the file 
system does push delayed-write data out to disk, after 
a while. We use the term update policy to describe 
the algorithm that decides what to write out, and 
when. 


UNIX systems have traditionally used a simple 
periodic update (or ‘‘PU’’) policy: once every 30 
seconds, all dirty blocks in the file system’s buffer 
cache are placed on the output queue for the ap- 
propriate disk. Recent analytical and simulation 
results, presented by Carson and Setia [2], showed 
that the PU policy actually performs worse in many 
cases than the write-through (WT) policy (in which 
all writes are immediate). Their analysis showed that 
PU causes increased mean response times for read 
operations; the results presented in this paper show 
that PU can also increase the variance in read 
response time. 


Because so many systems use the now suspect PU 
policy, it seemed like a good idea to validate the 
results of Carson and Setia on actual systems; their 
analyses and simulations, while careful, had to use 
certain simplifying assumptions. Carson and Setia 
suggested that several other update policies would 
provide better performance, so it also seemed useful 
to implement one of these and measure its perfor- 
mance. 


In this paper, after discussing the theoretical back- 
ground in some more detail, I describe an implemen- 
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tation of the interval periodic update (IPU) policy. I 
also present the results of some simple measurements 
of PU and IPU made using an actual implementation, 
rather than a model. The results appear to bear out 
the basic conclusions of Carson and Setia. What I 
found is that although use of delayed writes can im- 
prove mean response time, combining delayed writes 
with periodic update increases the variance in 
response time, but using interval periodic update both 
the mean and variance are improved. 


2. Theoretical background 

In this section, I discuss the previous simulation 
study, the possible alternative update policies, the 
choice of an update interval, and how this problem 
will scale with changes in technology. 


2.1. Carson and Setia’s results for Periodic 
Update 

The study done by Carson and Setia showed that 
the periodic update (PU) policy, while easily im- 
plemented, can perform quite badly. By dumping all 
the dirty blocks into the disk queue at once, PU can 
cause lengthy queueing delays. Latency-sensitive 
synchronous operations, such as file reads or 
synchronous writes, are forced to wait behind 
latency-insensitive asynchronous operations in the 
queue. If the system were to use a write-through 
(WT) policy instead, disk write operations would 
normally be spread out more over time, and the 
queues would be shorter. 

Carson and Setia show that the relative perfor- 
mance of WT and PU, measured in terms of mean 
read-response time, depends on several parameters: 
Read load 

The ratio of read operation arrival rate to the rate 

that the disk can support. A read load of 1.0 is 

one that the disk could just barely keep up with, if 
no writes were done. 


Write load 
The ratio of write operation arrival rate to the rate 
that the disk can support. 


Cache hit ratio for writes 
The fraction of write operations that are satisfied 
by modifying already-dirty blocks in the buffer 
cache. 
They expressed their results in tables showing, for a 
given read load and write load, what write-hit ratio 
PU requires in order to match or exceed the perfor- 
mance of WT. (Since WT causes disk writes to occur 
almost immediately, it cannot benefit from write hits 
in the cache.) 


They found that: 

e When the disk is not overloaded, ‘‘the cache 
must eliminate 80-90 % of all write accesses 
before the PU policy pays off.’’ 


e Under heavy loads, the PU policy gets some 
benefit from write-hits in the cache, and thus 
reduces the overall disk load. In cases where 
the WT policy would saturate the disk, the PU 
policy gives a lower mean read-response time. 

Cache hit rates vary, depending on cache size, ap- 
plication, and replacement policy, but we are unlikely 
to achieve average write-hit rates exceeding 80%. 
For example, Baker ef al. [1] traced user-level file 
access patterns in a distributed system and found that 
88% of the bytes written were written sequentially; if 
the applications in question used traditional buffering 
strategies, most of these sequential writes would not 
have hit already-dirty buffer-cache blocks, and so the 
write-hit rate must have been quite low. 


Carson and Setia found an analytical model for 
the mean response time. Their simulations were used 
to validate this result, but they apparently did not 
investigate other statistics besides the mean. As sec- 
tion 4 will show, PU is especially bad for worst-case 
response time, and for overall variance in response 
time. It is not hard to construct a situation where PU 
can lead to worst-case read response times of many 
seconds. 


2.2. Proposed alternative policies 

PU performs poorly because it generates long 
queues at periodic intervals, and subsequent 
synchronous requests get stuck at the ends of these 
queues. Perhaps one could improve read-response 
times by changing the queueing mechanism. 

UNIX systems typically maintain a single, un- 
prioritized operation queue for each disk. Suppose 
that read operations were given priority over 
asynchronous operations already in the queue. Then, 
read operations would not ‘‘see’’ a queueing delay 
caused by the queued delayed writes. Carson and 
Setia analyzed this periodic update with read priority 
(PURP) policy, and showed that it mostly solves the 
read-response time problem. I found PURP unsatis- 
factory, however, for several reasons: 

e Modification of the existing disk queue 
mechanism would require changes to numerous 
kernel modules, including all disk drivers and 
many of their clients (file systems, virtual 
memory systems, etc.) 


e Modern disk controllers and drives can queue 
several operations in their internal buffers. One 


would either have to accept the resulting queue- 
ing delays, or somehow modify the hardware to 
support the new queueing mechanism. 


e Carson and Setia point out that fixed-priority 
schemes such as PURP introduce the potential 
for infinite delays of delayed writes, if the read 
load is enough to saturate the disk. 
Peacock [12] reported that adding PURP to Sys- 
tem V Release 4 does seem to hurt benchmark 
performance, although it substantially increases 
single-file write throughput. 

Although implementation of a prioritized queueing 
scheme should be helpful in general, it is neither a 
complete solution to the bursty-update problem, nor 
is it the simplest solution. 


Carson and Setia also proposed the interval peri- 
odic update (IPU) policy, in which each dirty block is 
written out when its age reaches a threshold. If file 
modifications are nicely spread out in time, this 
means that the delayed writes back to the disk will 
also be spread out. As with PU and PURP, IPU uses 
the buffer cache to eliminate some disk writes that 
would be done by WT. Unlike PU and PURP, IPU 
normally avoids creating large bursts of writes, and 
so avoids the associated queueing delays. Carson and 
Setia show that IPU never gives worse mean read 
response time than WT or PU, although in some 
situations it may perform worse than PURP. 


Anna Hac [3] describes algorithms meant for 
deciding when to move dirty blocks from the buffer 
cache to a disk queue. In essence these replace the 
time-driven update policies with dynamic algorithms, 
which choose when to schedule disk writes based on 
the system load and disk queue length. Such adap- 
tive algorithms may perform better than any of the 
open-loop algorithms described in this paper, but 
they require more extensive changes to the operating 
system. I do not have anything useful to say about 
them, and they merit additional study. 


2.3. Choice of update interval 

UNIX systems have traditionally used a 30-second 
interval between writes generated by the PU policy. 
This means that, ignoring brief queueing delays, no 
information will be vulnerable to a crash for longer 
than 30 seconds. (Applications that depend on reli- 
able data storage should arrange to write their data 
synchronously, using the fsync() system call. Many 
other applications, such as compilers, can afford to 
use delayed writes because their output can easily be 
reconstructed, or because if the system crashes during 
a run, the resulting partial output is useless anyway.) 


Under the assumption that modifications occur 
more or less uniformly over time, the average age of 
a delayed-write block, when it is written to disk, is 15 
seconds. 


The IPU policy also has a characteristic time 
scale, the age at which a dirty buffer is scheduled for 
writing to the disk. If we set this to 30 seconds, then 
(again ignoring queueing delays), by definition, no 
information will be vulnerable to a crash for longer 
than 30 seconds. Also by definition, the average age 
of a block, when written to disk, is 30 seconds. 


If we choose to set the update delay for IPU the 
same as the update interval for PU, then both policies 
expose modified data to exactly the same worst-case 
vulnerability. Doing so, however, means that the 
mean age of dirty blocks is roughly twice as it would 
be with the PU policy. This suggests that IPU should 
see a higher write-hit ratio, and might avoid a few 
more disk write operations. 


Carson and Setia showed that, as the update inter- 
val was increased, the write-hit ratio at which PU 
began to pay off had to increase as well. We do 
expect this ratio to increase, but does it increase fast 
enough? Carson and Setia cite other work suggesting 
that it might [10] (see also a more recent study [1]). 
Unfortunately, I know of no actual test of this 
hypothesis. Still, one might suspect that the in- 
creased average lifetime of dirty blocks, when the 
IPU policy is used, might account for some perfor- 
mance advantage. (Note that the experiments 
reported in sections 4.1 and 4.2 carefully avoid 
repeated writes to the same block during an update 
interval, and so should encounter abnormally low 
cache hit ratios.) 


2.4. Scaling properties | 

One might ask why, if PU performs so badly, has 
this not been a problem in practice!. The answer is 
that buffer cache sizes and disk speeds are improving 
at different rates, which changes the ratio of disk 
queue length to disk service latency. 


4.2BSD and related systems only delay partial- 
block writes. Since most files are written sequen- 
tially, most blocks are filled quickly, and pending 
delayed writes of partial blocks are turned into 
asynchronous immediate writes of filled blocks. If 
the buffer cache is not large enough to hold many 
entire files, then it makes little sense to delay writes 


1Some systems have indeed exhibited poor behavior 
resulting from disk queues containing many asynchronous 
write requests [12]. 
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of full blocks, since these blocks are unlikely to be 
referenced again quickly. 


Memory chips get larger: this is one of the most 
reliable laws of recent history. One can quibble over 
whether the doubling time is 18 months or two years, 
but main memory sizes do increase (at roughly con- 
stant cost) as the years pass, and no other technology 
trend is quite so steep [5]. 


This trend means that, if the fraction of main 
memory used as a buffer cache remains constant, the 
absolute size of buffer caches is increasing with time. 
(Many systems, including Mach, Sprite, and recent 
UNIX implementations, no longer allocate a fixed 
fraction of main memory for the buffer cache, so it 
can grow to fill all of memory.) Since mean file sizes 
do not seem to be increasing as rapidly [1], perhaps 
as main memories get larger, using delayed writes 
would increasingly reduce disk traffic because of 
write-hits in the cache. (Traces do show that a few 
large files are getting much larger [1], and so caching 
algorithms should perhaps switch to write-through 
for any file larger than a certain size.) 


Although increasing DRAM densities lead to 
larger buffer caches and perhaps more use of delayed 
writes, disk technology trends are less encouraging. 
Disk access times have improved by perhaps one- 
third in ten years. Disk densities are increasing more 
rapidly, doubling every three years[5]. Disk 
bandwidths tend to scale as the square root of disk 
density (since density improvements come from both 
higher signal rates and smaller track spacings), and 
also benefit from small increases in rotation rate 
(from 3600 RPM to 5400 RPM), so over the past 
decade they have improved by perhaps a factor of 
Six. 


This means that the time it takes to write the en- 
tire buffer cache to disk is growing, in absolute terms. 
This is the key problem for the PU update policy: the 
queueing delays caused by its burst of write requests 
will get worse in the future. 

Delayed writes typically are queued in no par- 
ticular order. If the disk driver does nothing to 
schedule the write operations more carefully, the rate 
at which the queue can be drained depends mostly on 
the disk’s average access time. In table 2-1, I show 
how long it takes to write the entire buffer cache 
(assuming this is 10% of main memory) for systems 
typical of 1983 and 1993, and I rashly project current 
trends 10 years into the future. 


size size 
[1983 | 1MB|512B | 205 
iS 


Table 2-1: Scaling for random write of entire buffer cache 
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Table 2-2: Scaling for sequential write 
of entire buffer cache 












modem disk drives themselves reorder requests.) In 
table 2-2, I show how the delays for this process scale 
Over time. These numbers are much better than those 
in table 2-1, but they are probably unattainable, and 
in any case they are also getting worse. 


3. Implementation 


In this section, I discuss the implementation of 


various update policies, including the original UNIX 
implementation of PU, my approximate implemen- 
tation of IPU, and Sprite’s implementation of a 
similar policy. 


3.1. 4.2BSD implementation of the PU policy 


Before describing how I implemented the IPU 


policy, I will describe the ULTRIX implementation of 
the PU policy. My code is a simple modification of 
the ULTRIX implementation. 


Every 30 seconds, a daemon 


process 


(/etc/update) wakes up and does a sync() system 
call. This system call schedules writes for certain file 
system meta-data (the superblock, for example), and 
then calls the bflush() routine to update delayed 
writes. 


In the original 4.2BSD implementation, bflush() 


traversed a queue containing all of the valid blocks in 
the buffer cache, and scheduled an immediate write 
for each delayed-write block. Once a block was writ- 
ten and removed from the queue, the algorithm 


started again from the beginning of the queue; this is 
done because the queue could be manipulated by 
another process while the bflush() is waiting for the 
write to complete. Thus, in the worst case this re- 
quired time proportional almost to the square of the 


If the disk drive system can optimize the order of 
writes in the queue, in the best case the queue can be 
drained at full disk bandwidth. (Many UNIX disk 
drivers do sort requests to avoid seeks [6], and some 
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size of the buffer cache, and as memories grew 
larger, the /etc/update process started to con- 
sume a large fraction of the CPU. 


Recent versions of ULTRIX solve this problem by 
keeping a separate list of delayed-write blocks (i.e., 
blocks that are dirty and have not yet been queued for 
the disk). The bflush() routine simply traverses this 
list once; it need not examine clean blocks, nor does 
it have to examine any block more than once. 


3.2. Implementation of the IPU policy 

Carson and Setia point out that to implement a 
pure IPU policy would require a somewhat complex 
timer mechanism. Since the timers in the UNIX ker- 
nel are quantized, if one wants to issue write opera- 
tions in the same order as the blocks are dirtied, then 
one also needs to maintain a separate ordered queue. 
The overhead of the queue and timers may not be 
onerous, but it does complicate the kernel. 


But why implement pure IPU? If a practical im- 
plementation must use quantized time, why not use a 
relative coarse grain? If the algorithm that moves 
delayed-write blocks onto the disk queues runs, say, 
once per second, then the queues will receive bursts 
of writes, but the bursts will (on average) be about 
3% as large as they would be with the PU policy and 
a 30-second update interval. There will also be a 
l-second quantization error in the maximum vul- 
nerable period for a dirty block, but who cares? 


I chose to implement this kind of approximate 
IPU (or ‘‘AIPU’’). I created an alternative version of 
the sync() system call, smoothsync(), that takes as its 
parameter the age at which a dirty block should be 
written to disk?. The smoothsync() system call in- 
vokes a modified version of the bflush() routine, 
called bflush_smooth(). The main difference is that 
bflush_smooth() only schedules a buffer for writing if 
it has been dirty for longer than the specified 
threshold. 


I replaced the usual /etc/update program, 
which simply calls sync() once every thirty seconds, 
with one that calls smoothsync(30) once a second. 
This program also forces the file system meta-data to 
disk once every 30 seconds. 


The system must also record the time at which a 
buffer becomes dirty, using a timestamp field in the 


2Actually, instead of creating a true system call, I added 
an ioctl request, since this involved writing less code. The 
net effect should be identical. I added one more Joctl, to 
write file system meta-data to disk; this can be called once 
every 30 seconds, to preserve existing sync() semantics. 


header associated with each buffer. I did this by 
adding a few lines of code to a routine called brelse(), 
which is the only place where a buffer is placed on 
the delayed-write list. At this point, if the timestamp 
field is zero, then it is set to the current time; other- 
wise, it is left alone. The brelse() routine is also the 
only place where buffers are placed on the list of 
clean buffers; at this point, the timestamp field is set 
to zero. 


Thus, a clean buffer always has a zero timestamp. 
A dirty buffer always has a timestamp reflecting 
when it was first dirtied. Further modifications of a 
dirty block do not update the timestamp; otherwise, a 
block that was touched more often than once every 
30 seconds would never be written to the disk. 


ULTRIX already includes a timestamp field, 
busy time, in the buffer header. This is used only 
when a buffer is busy (i.e., on a disk operation 
queue), and so is never used when a buffer is on the 
delayed-write list. Therefore, it can be “‘time- 
shared’’ between these two uses. Other operating 
systems, including 4.xBSD, do not have such a times- 
tamp field, and so the implementation of IPU would 
require its addition. The space overhead is small; 
using modular arithmetic, a one-byte field would al- 
low maximum ages under 255 seconds. 


This modification re-introduces the possibility of 
N2 behavior in the worst case (that is, when about 
half of the buffer cache is due to be written during a 
single interval). I solved this by using an extra 
queue. The bflush_smooth() routine starts by travers- 
ing the delayed-write queue and moving ready-to- 
write buffers from there onto a pending-write queue; 
this can be done without blocking, and in time linear 
in the size of the buffer cache. In the second phase, 
bflush_smooth() writes the blocks on the pending- 
write queue. It could block during this phase, but 
because it simply pulls the first entry off of the 
pending-write queue, the algorithm is linear in the 
number of ready-to-write blocks, and so is also linear 
in the size of the buffer cache, even in the worst case. 


3.3. Sprite’s implementation of an IPU policy 

The Sprite operating system [9] implements an 
approximate IPU policy, although somewhat dif- 
ferent from the one I implemented. Sprite keeps 
track of the first-dirty time for the oldest dirty block 
of each file. Every five seconds, it scans all the dirty 
files in its cache, and if a file’s oldest dirty block is 
more than 30 seconds old, all of the file’s dirty blocks 
are written back [4]. Because this policy can in 
theory cause write-backs of fairly young blocks, it 
may perform somewhat differently from IPU or 
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AIPU. However, since most files are open for only 
brief periods [1], and so normally all writes to a file 
happen more or less simultaneously, the average 
lifetime of a dirty block should be close to 30 
seconds. 


4. Results 

In this section, I describe some simple measure- 
ments comparing my implementation of IPU against 
the original PU policy. All of these measurements 
were done using a modified ULTRIX version 4.3 ker- 
nel, running on either of two DECstation systems; the 
hardware is summarized in table 4-1. The SCSI disk 
drives used apparently do not reorder requests, and 
the ULTRIX version 4.3 SCSI device driver does not 
sort requests before issuing them. 


In all of the experiments in sections 4.1 and 4.2, 

two processes ran simultaneously: 

e A write-load generator, configured to dirty half 
the blocks in the buffer cache every 30 seconds, 
at a rate set so that no block would be touched 
twice in one minute. This means that none of 
the writes would hit a dirty block in the cache. 


e A read-load generator, which read 10,000 ran- 
domly chosen blocks from a large file, measured 
the read-response time for each block, and 
generated a histogram of the delays. 

On the faster system, the buffer cache held 1228 
blocks, and the read-load generator used a 34 Mbyte 
file. On the slower system, the buffer cache held 614 
blocks, and the read-load generator used a 32 Mbyte 
file. Both the generators used files stored on the 
same disk. 


I varied the system configuration, enabling or dis- 
abling the use of delayed writes for full data blocks, 
and changing the update policy. Six trials were done 
for each configuration. I did not measure a pure 
write-through configuration, since I do not expect 
any modern system to use pure WT, given its known 
poor behavior. 


4.1. Local tests 

The first set of tests show how the update policy 
and use of delayed writes affects response time for 
reads when the disk is local to the generating host. 
Figures 4-1 and 4-2 show the results for the faster 
and slower systems, respectively. 


The figures show a point for each trial, plotted 
with mean read response time on the horizontal axis, 
and the standard deviation of read response time on 
the vertical axis. Open squares show results for the 
default configuration (asynchronous writes, PU 


policy): moderately high mean response time, but 
low variance. When I enabled delayed writes without 
changing the update policy, the mean response time 
dropped somewhat, but the variance increased 
tremendously (open circles). I then switched to the 
AIPU policy (filled circles), which reduced the 
variance without markedly increasing the mean. One 
could also use the AIPU policy without delayed 
writes (filled squares); this results in about the same 
mean as the default configuration, but slightly less 
variance. 


At first, it seemed strange that the mean read 
response time is lower for delayed writes, since the 
write-load generator was constructed to avoid dirty- 
block cache hits; that is, the total number of write 
operations should be the same in either case. 
However, the generator does its writes sequentially, 
which means that when a group of delayed writes is 
sent to the disk, they are likely to be directed at 
nearby disk locations, and many end up in the same 
cylinder group. This reduces the average number of 
disk seeks per write (relative to non-delayed writes), 
and so reduces the load on the disk. Since the read- 
load generator issues random-access reads as fast as 
possible, disk seeks are probably the rate-limiting 
bottleneck, and a reduction in the number of write- 
related seeks leaves more disk-seek capacity to be 
used for reads. Also, issuing multiple writes to the 
same region of the disk may reduce rotational 
latencies. 


For example, figure 4-1 shows that without 
delayed writes, the system can support about 32 
reads/sec., and the write-load generator in this case 
issues about 20 writes/sec., for a total load of about 
52 disk operations/sec. Based on the average access 
time shown in table 4-1, the disk drive should support 
about 56 random-access operations/sec, which cor- 
responds closely. With delayed writes, the read rate 
increases to about 38 reads/sec., which means that the 
disk should only be doing about 14-17 random 
writes/sec. This suggests that some fraction of the 20 
blocks written are being combined with neighbors. 


Figures 4-1 and 4-2 show the standard deviation 
of the response times, but this hides how truly awful 
things can be with the PU policy. Figures 4-3 and 
4-4 show histograms of response time for the faster 
and slower systems, respectively. In these _his- 
tograms, all six trials for each configuration have 
been combined, and the x-axis has been divided into 
logarithmic buckets. 


The danger of combining delayed writes and the 
PU policy now shows up clearly (open circles on the 
histograms). Because the PU policy puts all the dirty 


Description CPU type 












SPECMark 
rating 


Faster system | DECstation 5000 | 18.5 RZ58 | 18 msec 
model 200 


Slower system | DECstation 3100 






Average 
access time 


Bandwidth 
3.8-5.0 Mbyte/sec 


Disk 
type 


Table 4-1: Systems used for measurements 
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Figure 4-1: Local random reads, fast system 
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Figure 4-2: Local random reads, slow system 


delayed-write blocks on the queue at once, which in 
these trials could be as high as about 600 blocks, 
some reads will be delayed by up to 8 or 9 seconds. 
A significant number are delayed by longer than one 
second. 


With the AIPU policy (filled circles), however, 


only one thirtieth of the dirty blocks show up on the 
disk queue at any one time. The graph in figure 4-3 


shows a small peak in read-response time at about 
300 msec. Since we expect 20 blocks (600/30) to be 
queued once a second, this implies a mean write- 
delay of about 15 msec, which corresponds closely 
with the RZ58’s specified average access time of 18 
msec. More important, this configuration shows a 
maximum delay of 746 msec, and all but one of the 
60,000 samples are below 450 msec. AIPU clearly 
improves upon PU, in this experiment. 


a 
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Even when I disabled delayed writes, the PU 
policy still lead to occasional long delays (up to five 
or six seconds). In other words, AIPU performs bet- 
ter than PU even if one does not want to abandon the 


improved safety of using asynchronous writes. 


4.2. Remote tests 

Most NFS client implementations, to ensure cache 
consistency and detection of write errors, force 
delayed writes to the server when a file is closed. 
This means that NFS clients get little advantage from 
delayed writes, and do not depend much on the up- 
date policy. More recent file service protocols, 
however, such as Sprite [9], Spritely NFS [15], and 
NQNFS [7], use explicit cache-consistency protocols 
and so can benefit from delayed writes. 


I ran a set of experiments using Spritely NFS to 
access a remote disk, using the “‘slower’’ system as 
the client, and the ‘‘faster’’ system as the server. 
Both the read-load and write-load generators ran on 
the same client host. The server host supports 
PrestoServe™ non-volatile RAM (NVRAM). I ran 
six trials in each of six configurations, as shown in 
figure 4-5, using random-access reads. I also ran six 
trials in each of four configurations, as shown in 
figure 4-6, using sequential reads; in this set of trials, 
delayed writes were always used. The sequential- 
read generator cycled through the blocks of a file 
much larger than the buffer cache on either the client 
or server, so no reads were satisfied by the caches. 


These experiments showed less conclusive results 
than the local-disk experiments. For random reads 
(figure 4-5), delayed write combined with PU results 
in slightly poorer read response time than delayed 
write with AIPU, although several AIPU trials ex- 
hibited much higher variance than any other trials. 
PrestoServe also seems to be generally beneficial. 


For sequential reads (figure 4-6), AIPU seems 
mostly to reduce the variation between trials. The 
mean response time is slightly worse for AIPU than 
for PU without PrestoServe, and slightly better with 
PrestoServe. 


4.3. Kernel-build benchmark 

To see if the update policy had any effect on a 
‘‘real’’ application, I measured the time it took to 
compile and link the entire ULTRIX V4.3 kernel, un- 
der different combinations of update policy and write 
policy. These tests were all run on the “‘faster’’ sys- 
tem, with all kernel source and object files on the 
local disk. This process creates about 43 MB of ob- 
ject files, and a similar amount of temporary file data. 
I ran three trials in each configuration; the means of 
the results are shown in table 4-2. 


The AIPU policy shows a small but clear advan- 
tage over the PU policy, especially when using 
delayed writes. In fact, when using delayed writes 
with the PU policy, the net elapsed time is actually 
slightly worse than the normal ULTRIX configuration. 
The combination of delayed writes and AIPU is 
about 2.1% faster than the asynch write/PU combina- 
tion. Note that this benchmark is rather CPU-bound; 
on these trials, the CPU idle time averaged between 
12% and 14%. I would expect AIPU to show a larger 
benefit on a more I/O-bound application. 


The table also shows the number of disk writes 
charged to the processes involved in the build. The 
kernel charges a process the first time a block is 
dirtied; subsequent writes to a dirty block are not 
counted. We see that while use of delayed writes 
substantially decreases the write count, by increasing 
the chance that a write will hit a dirty block, the 
AIPU policy provides another big decrease, probably 
by increasing the average lifetime of a dirty block. 
The combination of delayed writes and AIPU 
eliminates over 35% of the disk write operations; 
AIPU by itself accounts for less than half of the im- 
provement. 


The table does not show the number of read I/Os 
charged, since this hardly varies at all with the update 
policy, and only a few per cent with the use of 
delayed writes. 


4.4. CPU costs of update mechanism 

Does the update policy have any effect on the 
CPU-time cost of doing the updates? The AIPU 
policy scans the list of dirty blocks 30 times more 
often than the PU policy does, so one might expect it 
to consume more CPU time. 


I measured the CPU time charged to the 
/etc/update update process during the kernel- 
build benchmark. This includes both user-mode time 
(which should be nearly zero) and kernel-mode time 
(which accounts for all CPU time spent in the 
bflush() routine, as well as other activity). It does not 
include kernel-mode time spent as a result of disk 
interrupts. The results are shown in table 4-3; note 
that the underlying measurements were done with 1- 
second resolution, and so small variations in the 
results are not significant. 


The table shows that, not surprisingly, aggressive 
use of delayed writes does increase the CPU time 
spent in finding delayed-write blocks and scheduling 
them for disk I/O. Contrary to my expectation, 
however, AIPU actually reduces the CPU cost of do- 
ing updates (although the total cost is in either case 
insignificant). 
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Figure 4-4: Local random reads, slow system (histogram of response times) 


I cannot provide a definite explanation, but I 
suspect that the cause may be the difference in the 
number of delayed-write blocks actually written to 
disk. Both algorithms scan every delayed-write 
block in the buffer cache, and since the AIPU algo- 
rithm does this 30 times more often, this suggests that 
the cost of actually scanning blocks does not 
dominate the CPU time; the time is probably spent in 
the device driver>. I ran additional trials using AIPU 


3Note that for the kemel-build benchmark, PU with 
delayed writes scans 16749 blocks in 6.0 CPU seconds, 
placing a lower bound of 6/16.7 = 0.36 msec per block 
scanned. This corresponds to several hundred instruction 
executions, much more than could be accounted for by the 
scanning loop itself, so most of this time must be spent in 
the disk driver. 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


with a 30-second interval between updates (but still 
using a 30-second age threshold for writing back 
dirty blocks). This took more CPU time than AIPU 
with a 1-second interval, but still less than PU. 


Table 4-3 shows that AIPU scans far more blocks 
than PU, because PU scans each block exactly once, 
but AIPU may scan a dirty block many times before 
deciding that it is old enough to write to disk. 
However, AIPU actually writes fewer blocks than 
PU, because AIPU allows the average dirty block to 
stay in the cache longer. Recall that with PU, the 
average age of a block when written to disk is 15 
seconds, but with AIPU the average is 30 seconds, 
assuming a uniform rate of file writes. (AIPU-30, 
with a 30-second threshold and a 30-second interval 
between updates, yields an average age of 45 
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Figure 4-6: Remote sequential reads 
Periodic Approx. interval | Relative 
update periodic update | elapsed time 
Asynch writes | 3280 sec. elapsed | 3234 sec. elapsed | 0.986 
40548 writes 37854 writes 
Delayed writes | 3298 sec. elapsed | 3206 sec. elapsed | 0.972 
31523 writes 25966 writes 


Mean values for 3 trials 
Table 4-2: Elapsed time on kernel-build benchmark 












seconds, and so does slightly fewer writes than 
AIPU-1). 


A dirty block that stays in the cache for a longer 
time is more likely to be modified again before being 
written to disk, which results in fewer modifications 
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of clean blocks, and hence fewer writes to disk. I 
counted the number of modifications of currently 
dirty blocks (in the brelse() routine); the numbers are 
shown in table 4-3. AIPU with delayed writes results 
in moderately more dirty-block modifications (426K 
vs. 419K for PU); the difference in dirty-block 
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CPU time used 
by /etc/update 
(mean of 3 trials) 


Blocks scanned 


PU, 
async writes 












Blocks written 





PU, AIPU-1, AIPU-1, AIPU-30, AIPU-30, 
delayed writes | async writes | delayed writes | async writes | delayed writes 
— — 

9531 16749 290214 508448 14593 25847 
by /etc/update 

9531 16749 5375 10023 5266 9519 
by /etc/update 
Dirty blocks modified | 391K 419K 393K 426K 405K 426K 
during benchmark 
biowait() sleep events | 8234 8274 7607 8121 
during benchmark 







7496 7751 





Table 4-3: Statistics for kernel-build benchmark 


modifications is roughly the same as the difference in 
delayed-write disk I/Os. 


The reduction in actual disk I/Os, caused by 
longer cache lifetimes and more _ dirty-block 
modifications, may account for all of the elapsed- 
time advantage of AIPU over PU on the kernel-build 
benchmark. As table 4-3, with AIPU-1 and espe- 
cially AIPU-30, the kernel ‘‘sleeps’’ less often for 
disk I/O than it does with PU (although I could not 
measure the total sleep time). However, this effect 
should not contribute to the random-access results in 
sections 4.1 and 4.2, since these experiments were 
constructed to avoid any cache hits on file writes. 


4.5. Burstiness of file writes 

The advantage of AIPU over PU is that the latter 
clumps together all delayed writes from a 30-second 
period into a single burst of writes, while the former 
preserves the original spacing of the file writes (with 
l-second resolution). If the file writes themselves 
arrive in bursts, this eliminates AIPU’s advantage. In 
the worst case, when all file writes during a 30- 
second period occur nearly at once, both policies 
should perform the same. 


In the experiments reported in sections 4.1 and 
4.2, the write-load generator distributed file writes 
uniformly over time, which is the best case for AIPU. 
The kernel-build benchmark should be more repre- 
sentative of real use; how bursty is its file-write pat- 
tern? (Note that this kind of single-user benchmark is 
more likely to exhibit burstiness than a multi-user 
benchmark, because the latter will tend to spread out 
the file-system load among several jobs.) 


I modified the bflush() and bflush_smooth() 
routines to keep a histogram of the number of blocks 
they queue for the disk on each invocation. I then ran 
one trial of the kernel-build benchmark for each of 
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PU, AIPU-1, and AIPU-30. In all cases, the update 
period (age at which a block is queued to the disk) 
was 30 seconds. 


With AIPU-1, since blocks are queued 30 seconds 
after the corresponding file write, the burstiness in 
queue-batch size directly mirrors the burstiness in the 
file-write pattern (with 1-second granularity). 

The results are shown in figure 4-7, in the form of 
cumulative distributions for the number of blocks 
written as a function of the burst size. Each policy 
writes a different number of blocks (see table 4-3), so 
the curves do not end at the same ordinate. For 
AIPU-1 (with 1 second between updates and a 30- 
second period), 95% of all delayed-write blocks writ- 
ten were queued by bflush_smooth() in bursts of 40 
blocks or fewer. The largest burst contained 104 
blocks. In other words, the application delivered 
relatively small bursts of writes to the file system. 


For PU (with 30 seconds between updates), 95% 
of the delayed-write blocks queued were in bursts of 
more than 62 blocks, and 50% were queued in bursts 
of more than 182 blocks. The largest burst contained 
344 blocks, which (assuming an 18 msec. mean ac- 
cess time) delayed any subsequent synchronous 
operation by over 6 seconds. 


To summarize figure 4-7, the kernel-build 
benchmark does indeed spread out its writes over 
periods longer than a second. This results in far 
smaller disk-queue bursts when AIPU-1 is used than 
when PU is used. 


5. Future work 

Carson and Setia proposed using a periodic up- 
date with read priority (PURP) policy. Although I 
resisted implementing PURP, because of its greater 
complexity, in the general case one cannot avoid long 
disk queues even with IPU or an approximation. For 
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Figure 4-7: Burst-size distributions for writes during kernel-build benchmark 


example, if an application manages to dirty the entire 
buffer cache within a second or so, thirty seconds 
later an IPU policy will schedule writes for all those 
blocks, and the effect will be the same as with the PU 
policy. In other words, IPU depends on a relatively 
uniform distribution of file writes (across time) to 
achieve its more uniform distribution of disk writes. 


If long disk queues are inevitable, and most of the 
entries on such queues are inherently asynchronous, 
then giving priority to reads and synchronous writes 
should improve response time. I suspect that, even 
with a priority scheme, an IPU policy could outper- 
form PU. Suppose the buffer cache is entirely filled 
by dirty blocks; then, a read operation must wait until 
the system cleans a block before it can complete. 
The IPU policy generates clean blocks once a second 
or so (assuming a uniform distribution of disk writes, 
across time), but the PU policy only does this every 
30 seconds. Thus, with PU, reads would be more 
likely to block waiting for a free buffer. This is 
speculation; we need experiments, simulation, or 
more formal analysis to discover the truth. 


Peacock [11, 12] has described several systems 
that use PURP, but apparently did not explore 
modified update policies. He found that adding read 
priority to System V Release 4, in which the buffer 
cache can be quite large, actually reduced benchmark 
performance by preventing asynchronous requests 
from getting a sufficient share of the disk. 

My experiments have all used a 30-second period 
for PU and for the lifetime of dirty blocks in IPU, and 
a one-second granularity for IPU. I suspect that use 
of different periods and granularities, within reason- 


able limits, will not make a big difference, but this 
should be the subject of additional experiments. 


Some UNIX file system implementations attempt 
to cluster several blocks together when performing a 
disk write [8,11]. That is, if the cache contains 
several dirty blocks that are adjacent on disk, the file 
system or disk driver attempts to write them all at 
once, which improves throughput by eliminating 
seeks and rotational delays. Clustering can be done 
in several different ways, and may interact with the 
delayed write policy (that is, are all data writes 
delayed, or only partially-filled blocks?) and with the 
update policy. The ULTRIX systems tested in section 
4 use a clustering algorithm; I have not done experi- 
ments to see if this affects the relative performance of 
update policies. 


6. Summary and conclusions 
The experiments described in this paper show that 
e Use of delayed writes can improve overall file 
system performance, including read response 
times, on both synthetic and actual workloads, 


e But when delayed writes are combined with the 
traditional periodic update policy, variance in 
read response time increases significantly, and 
benchmark performance may decrease, 

e So one should use a better update policy, such 
as interval periodic update or an approximation, 
whenever one uses a delayed write policy. 


I also showed that one can easily implement an 
approximate interval periodic update policy, with 
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remarkably limited changes to the kernel of a tradi- 
tional operating system. 
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Abstract 

This paper describes the structure and performance 
characteristics of a commercial file system designed 
for use on desktop, laptop, and notebook computers 
running the UNIX operating system. Such systems 
are characterized by their small disk drives dictated 
by system size and power requirements. In addition, 
these systems are often used by people who have lit- 
tle or no experience administering Unix systems. 
The Desktop File System attempts to improve overall 
system usability by transparently compressing files, 
increasing file system reliability, and simplifying 
administrative interfaces. The Desktop File System 
has been in production use for over a year, and will 
be included in future versions of the SCO Open 
Desktop Unix system. Although originally intended 
for a desktop environment, the file system is also 
being used on many larger, server-style machines. 


1. Overview 
This paper describes a commercial file system 
designed for use on desktop, laptop, and notebook 
computers running the UNIX operating system. We 
describe design choices made and discuss some of the 
interesting ramifications of those choices. The most 
notable characteristic of the file system is its ability 
to compress and decompress files ‘‘on-the-fly.”” We 
provide performance information that proves such a 
file system is a viable option in the Unix marketplace. 
When we use the term ‘‘commercial file sys- 
tem,’” we mean to imply two things. First, the file 
system is used in real life. It is not a prototype, nor is 
it a research project. Second, our design choices 
were limited to the scope of the file system. We were 
not free to rewrite portions of the base operating sys- 
tem to meet our needs, with one exception (we pro- 
vided our Own routines to access the system buffer 
cache). 


1.1 Goals 

Our goals in designing the Desktop File System 
(DTFS) were influenced by our impressions of what 
the environment was like for small computer systems, 
such as desktop and laptop computers. The physical 
size of these systems limits the size of the power sup- 
plies and hard disk drives that they can use at a rea- 
sonable cost. Systems that are powered by batteries 
attempt to use small disks to minimize the power 
drained by the disk, thus increasing the amount of 
time that the system can be used before requiring that 
the batteries be recharged. 

It is common to find disk sizes in the range of 
80 to 300 Megabytes in current 80x86-based laptop 
and notebook systems. Documentation for current 
versions of UnixWare recommend a minimum of 80 
MB of disk space for the personal edition, and 120 
MB for the application server. Similarly, Solaris 
documentation stipulates a minimum of 200 MB of 
disk space. These recommendations do not include 
space for additional software packages. 

We also had the impression that desktop and 
notebook computers were less likely to be admin- 
istered properly than larger systems in a general com- 
puting facility, because the primary user of the sys- 
tems will probably be performing the administrative 
procedures, often without the experience of profes- 
sional system administrators. These impressions led 
us to the following goals: 

e Reduce the amount of disk space needed by con- 
ventional file systems. 

e Increase file system robustness in the presence of 
abnormal system failures. 

e Minimize any performance degradation that might 
arise because of data compression. 

e Simplify administrative interfaces when possible. 

The most obvious way to decrease the amount 
of disk space used was to compress user data. (We 
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use the term ‘‘user data’’ to refer to the data read 
from and written to a file, and we use the term ‘‘meta 
data’’ to refer to accounting information used inter- 
nally by the file system to represent files.) Our 
efforts did not stop there, however. We designed the 
file system to allocate disk inodes as they are needed, 
so that no space is wasted by unused inodes. In addi- 
tion, we use a variable block size for user data. This 
minimizes the amount of space wasted by partially- 
filled disk blocks. 

Our intent in increasing the robustness of the 
file system stemmed from our belief that users of 
other desktop systems (such as MS-DOS) would rou- 
tinely shut their systems down merely by powering 
off the computer, instead of using some more gradual 
method. As it turned out, our eventual choice of file 
structure required us to build robustness in anyway. 

From the outset, we realized that any file sys- 
tem that added another level of data processing would 
probably be slower than other file systems. 
Nevertheless, we believed that this would not be 
noticeable on most systems because of the disparity 
between CPU and disk I/O speeds. In fact, current 
trends indicate that this disparity is widening as CPU 
speeds increase at a faster pace than disk I/O speeds 
[KAR94]. Most systems today are bottlenecked in 
the I/O subsystem [OUS90], so the spare CPU cycles 
would be better spent compressing and decompress- 
ing data. 

Our assumptions about typical users led us to 
believe that users would not know the number of 
inodes that they needed at the time they made a file 
system, and some might not even know the size of a 
given disk partition. We therefore endeavored to 
make the interfaces to the file system administrative 
commands as simple as possible. 


1.2 Related Work 
Several attempts to integrate file compression and 
decompression into the file system have been made in 
the past. [TAU91] describes compression as applied 
to executables on a RISC-based computer with a 
large page size and small disks. While compression 
is performed by a user-level program, the kernel is 
responsible for decompression when it faults a page 
in from disk. Each memory page is compressed 
independently, with each compressed page stored on 
disk starting with a new disk block. This simplifies 
the process of creating an uncompressed in-core 
image of a page from its compressed disk image. 
DTFS borrows this technique. 

The disadvantage with this work is that it 
doesn’t fully integrate compression into the file sys- 
tem. Only binary executable files are compressed, 


and applications that read the executables see 
compressed data. These files must be uncompressed 
before becoming intelligible to programs like 
debuggers, for example. The compression and 
decompression steps are not transparent. 

In [CAT91], compression and decompression 
are integrated into a file system. Files are 
compressed and decompressed in their entirety, with 
the disk split into two sections: one area contains 
compressed file images, and the other area is used as 
a cache for the uncompressed file images. The 
authors consider it prohibitively expensive to 
decompress all files on every access, hence the cache. 
This reduces the disk space that would have been 
available had the entire disk been used to hold only 
compressed images. 

The file system was prototyped as an NFS 
server. Files are decompressed when they are first 
accessed, thus migrate from the compressed portion 
of the disk to the uncompressed portion. A daemon 
runs at off-peak hours to compress the files least- 
recently used, and move them back to the compressed 
portion of the disk. 

Transparent data compression is much more 
common with the MS-DOS operating system. Pro- 
ducts like Stacker have existed for many years. They 
are implemented as pseudo-disk drivers, intercepting 
I/O requests, and applying them to a compressed 
image of the disk [HAL94]. 

We chose not to implement our solution as a 
pseudo-driver because we felt that a file system 
implementation would integrate better into the Unix 
operating system. For example, file systems typically 
maintain counters that track the number of available 
disk blocks in a given partition. A pseudo-driver 
would either have to guess how many compressed 
blocks would fit in its disk partition or continually try 
to fool the file system as to the number of available 
disk blocks. No means of feedback is available, short 
of the pseudo-driver modifying the file system’s data 
structures. If the pseudo-driver were to make a guess 
at the time a file system is created, then things would 
get sticky if files didn’t compress as well as expected. 

In addition, a file system implementation gave 
us the opportunity to employ transaction processing 
techniques to increase the robustness of the file sys- 
tem in the presence of abnormal system failures. A 
pseudo-driver would have no knowledge of a given 
file system’s data structures, and thus would have no 
way of associating a disk block address with any 
other related disk blocks. A file system, on the other 
hand, could employ its knowledge about the interrela- 
tionships of disk blocks to ensure that files are always 
kept in a consistent state. 
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Figure 1. File System Layout 


Much more work has gone into increasing file 
system robustness than has gone into integrating 
compression into file systems. Episode [CHU92], 
WAFL [HIT94], and the Log-structured File System 
[SEL92] all use copy-on-write techniques, also 
known as shadow paging, to keep files consistent. 
Copies of modified disk blocks are not associated 
with files until the associated transaction is commit- 
ted. (Shadow paging is discussed in Section 5.1.) 


2. File System Layout 

Figure 1 shows the file system layout of DTFS. The 
super block contains global information about the file 
system, such as the size, number of free blocks, file 
system state, etc. The block bitmap records the status 
(allocated or free) of each 512-byte disk block. Simi- 
larly, the inode bitmap, which is the same size as the 
block bitmap, identifies the disk blocks that are used 
for inodes. Finally, the rest of the disk partition is 
available for use as user data and meta data. 

The only critical information kept in the super 
block is the size of the file system. The rest of the 
information can be reconstructed by fsck. Simi- 
larly, the block bitmap can be reconstructed by 
fsck, should it be corrupted by a bad disk block. 
The inode bitmap is more critical, however, as fsck 
uses it as a guide to identify where the inodes are 
stored in the file system (inode placement is dis- 
cussed in the next section). If the inode bitmap is 
corrupted, files will be lost, so users ultimately have 
to rely on backups to restore their files. 


2.1 Inode Placement 

As stated previously, one way DTFS saves space is 
by not preallocating inodes. Any disk block is fair 
game to be allocated as an inode. This has several 
interesting repercussions. 

First, there is no need for an administrator to 
guess how many inodes are needed when making a 
file system. The number of inodes is only limited by 
the physical size of the disk partition and the size of 
the identifiers used to represent inode numbers. This 
simplifies the interface to mkfs. 

Second, the inode allocation policy renders the 
NFS generation count mechanism [SAN85] ineffec- 
tive. In short, when a file is removed and its link 
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count goes to 0, an integer, called the generation 
count, in the disk inode is incremented. This genera- 
tion count is part of the file handle used to represent 
the file on client machines. Thus, when a file is 
removed, active handles for the file are made invalid, 
because newly-generated file handles for the inode 
will no longer match those held by clients. 

With DTFS, the inode block can be reallocated 
as user data, freed, and then reallocated as an inode 
again. Thus, generation counts cannot be retained on 
disk. We solve this problem by replacing the genera- 
tion count with the file creation time concatenated 
with a per-file-system rotor that is incremented every 
time an inode is allocated. 


2.2 File Structure 

A DTFS file is stored as a B*tree [COM79], with the 
inode acting as the root of the tree. Two factors led 
us to this design. The first of these is the need to con- 
vert between logical file offsets and physical data 
offsets in a file. For example, if an application were 
to open a 100000-byte and seek to byte position 
73921, we would need an efficient way to translate 
this to the disk block containing that data. Since the 
data are compressed, we cannot simply divide by the 
logical block size to determine the logical block 
number of the disk block containing the data. 

A simple solution would be to start at the 
beginning of the file and decompress the data until 
the requested offset is reached. This, however, would 
waste too much time. To make the search more effi- 
cient, we use logical file offsets as the keys in the 
B*tree nodes. The search is bounded by the height of 
the tree, so it is doesn’t cost any more to find byte 
n+100000 than it costs to find byte n. 

The second factor is a phenomenon we refer to 
as spillover. Consider what happens when a file is 
overwritten. The new data might not compress as 
well as the existing data. In this case, we might have 
to allocate new disk blocks and insert them into the 
tree. With the B*tree structure, we can provide an 
efficient implementation for performing insertions 
and deletions. 

When a file is first created, it consists of only 
an inode. Data written to the file are compressed, 
stored in disk blocks, and the disk block addresses 
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Figure 2. A One-level File 


and lengths are stored in the inode (see Figure 2). 
DTFS uses physical addresses instead of logical ones 
to refer to disk blocks, because a ‘‘block’’ in DTFS 
can be any size from 512 bytes to 4096 bytes, in mul- 
tiples of 512 bytes. In other words, a disk block is 
identified by its starting sector number relative to the 


beginning of the disk partition. 
Inode 
gts | 
Leaf Node Leaf Node 


—e ie 


Figure 3. A Two-level File 


When the inode’s disk address slots are all 
used, the next block to be added to the file will cause 
the B*tree to grow one level. (See Figure 3; note that 
most links between nodes are omitted for clarity). 
Two disk blocks are allocated to be used as leaf 
nodes for the tree. The data blocks are divided 
between the two leaf nodes, and the inode is modified 
to refer to the two leaf nodes instead of the data 
blocks. In this way, as the tree grows, all data blocks 
are kept at the leaves of the B* tree. 

Figure 4 shows what would happen if the 
Bttree were to grow an additional level. The inode 


now refers to interior nodes of the B*tree. Interior 
nodes either refer to leaf nodes or to other interior 
nodes. Each entry in an interior node refers to a sub- 
tree of the file. The key in each entry is the max- 
imum file offset represented by the subtree. The keys 
are used to navigate the B*tree when searching for a 
page at a particular offset. The search time is 
bounded by the height of the B*tree. 


3. Compression 

DTFS is implemented within the Vnodes architecture 
[KLE86]. DTFS compresses the data stored in regu- 
lar files to reduce disk space requirements (directories 
remain uncompressed). Compression is performed a 
page at a time, and is triggered during the vnode 
putpage operation. Similarly, decompression 
occurs during the getpage operation. Thus, 
compression and decompression occur ‘‘on-the-fly.”’ 

The choice of compressing individual pages 
limits the overall effectiveness of adaptive compres- 
sion algorithms that require a lot of data before their 
tables are built. Nonetheless, some algorithms do 
quite well with only a page of data. This was the 
natural design choice for DTFS, since it was origi- 
nally designed for UNIX System V Release 4 
(SVR4), an operating system whose fundamental vir- 
tual memory abstraction is the page. (On an Intel 
80x86 processor, SVR4 uses a 4KB page size.) 

Each page of compressed data is represented on 
disk by a disk block descriptor (DBD). The DBD 
contains information such as the logical file offset 
represented by the data, the amount of compressed 
data stored in the disk block, and the amount of 
uncompressed data represented. The DBDs exist 
only in the leaf nodes of the B*tree. 

Because each page is compressed individually, 
DTFS is best suited to decompression algorithms that 
build their translation tables from the compressed 
data, thus requiring no additional on-disk storage. 
The class of algorithms originated by Lempel and Ziv 
[ZIV77], [ZIV78] are typical of such algorithms. 

The original version of DTFS only supported 
two compression algorithms (an LZW derivative 
[WEL84] and ‘‘no compression’’). The latest version 
of DTFS allows for multiple compression algorithms 
to be used at the same time. We designed an applica- 
tion programming interface so that customers can add 
their own algorithms if they should so desire. 


4. Block Allocation 

With the exception of the super block and bitmaps 
(recall Figure 1), every disk block in a DTFS file sys- 
tem can be allocated for use as an inode, other meta 
data, or user data. The basic allocation mechanism is 
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Figure 4. A Three-level File 
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simple — the file system keeps track of the first 
available block in the partition and allocates blocks 
Starting from there. Requests to allocate multiple 
blocks search the file system for a free contiguous 
region at least as large as the number of blocks 
requested. If no region is found to be large enough, 
allocation switches to a first-fit algorithm that returns 
as many blocks as are contiguously available, starting 
with the first free block. 

Unmodified, this allocation mechanism would 
cause severe file system fragmentation for several 
reasons. First, disk blocks backing a page must be 
allocated at the time a file is written, so that errors 
like ENOSPC can be returned to the application. 
Usually pages are written to disk sometime later, as 
the result of page reuse or file system hardening. If 
DTFS were to defer block allocation until it wrote a 
page to disk, then DTFS would be unable to inform 
the application of a failure to allocate a disk block. 

With dynamic data compression, after each 
page is compressed, the leftover blocks are freed back 
to the file system. So for every eight-block page 
there might be a run of, say, four blocks allocated fol- 
lowed by four blocks freed. As each page requires 
eight disk blocks before compression, no other page 
would use any of the four-free-block regions until the 
entire file system is fragmented and allocation 
switches to first-fit. 
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The second contributor to fragmentation is the 
shadow paging algorithm, which requires that every 
overwritten page have new blocks allocated before 
the old blocks are freed. The new blocks would be 
allocated from some free region, most likely in a dif- 
ferent place in the file system. 

The final reason for fragmentation is that any 
block can be used for any purpose, so inode blocks 
and user data blocks could be intermingled. This is 
partly mitigated by the fact that inode and meta data 
blocks can use the few-block regions between 
compressed user data blocks. 

We use several techniques to avoid fragmenta- 
tion. First, user data blocks are allocated in clusters, 
one cluster per inode. When the DTFS write func- 
tion attempts to allocate disk space for a user data 
page, a much larger chunk of space (typically 32 
blocks) is actually allocated from the general disk 
free space pool, and the blocks for the page in ques- 
tion are allocated from this cluster. The remainder of 
the cluster is saved in the in-core inode for use by 
later allocations. When the inode is no longer in use, 
the unused blocks in the cluster are returned to the 
file system. This has the effect of localizing disk 
space belonging to a particular file. 

Second, after compression, user data blocks are 
reassigned from within the cluster to eliminate any 
local fragmentation. For example, if the first page of 
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a file allocated blocks 20-27 and the second page 
allocated blocks 28-35, and after compression each 
page needed the first four blocks of their respective 
eight-block region, the second page would be reas- 
signed to use blocks 24~27, eliminating the four- 
block fragment of free space. 

Finally, we keep two first-free block indicators, 
and allocate inode blocks starting from one point and 
user data blocks from another. 


5. Reliability 

Our goal of increasing file system reliability was ori- 
ginally based on our belief that naive uses might not 
shut a system down properly. File systems like $5 
(the traditional System V file system [BAC86]) and 
UFS (The System V equivalent of the Berkeley Fast 
File System [MCK84]) can leave a file with cor- 
rupted data in it if the system halts abnormally. 
Unless an application is using synchronous I/O, when 
an application writes to a file such that a new disk 
block needs to be allocated, file systems usually write 
both the user data and the meta data in a delayed 
manner, caching them in mainstore and writing them 
out to disk later, to improve performance. 

The problem with this approach is that an 
abnormal system halt can leave a file containing gar- 
bage. Consider what happens when the inode’s meta 
data are written to disk, but the system halts abnor- 
mally before the user data are written. (This case is 
more likely to occur than one might think — the sys- 
tem tries to flush ‘‘dirty’’ pages and buffers to disk 
over a tunable time period, usually around 60 
seconds. Thus, in the worst case, it is possible for 60 
seconds to elapse before user data are flushed to 
disk.) In this case, the inode’s meta data will reflect 
that a newly allocated disk block contains user data, 
but the actual contents of the disk block have not 
been written yet, so the file will end up containing 
random user data. The contents of the disk block can 
be as innocuous as a block of zeros, or as harmful as 
a block that came from another user’s deleted file, 
thus presenting a security hole. (When a block is 
freed or reused for user data, its contents are usually 
not cleared.) 

This reliability problem can be solved by care- 
fully ordering the write operations. If the user data 
are forced out to disk before the inode’s meta data, 
then the file will never refer to uninitialized disk 
blocks. The ordering must be implemented carefully, 
because forcing the user data out to disk before the 
meta data can hurt performance. On the other hand, 
delaying the update of the meta data until the user 
data have found their way to disk can cause changes 
to the inode to be lost entirely if the system crashes. 


Our choice of compressing user data, as it 
turned out, forced us to increase reliability to keep the 
user data and meta data in sync. Because a disk 
block can contain a variable amount of data, and 
because that data, once compressed, can represent an 
amount of data greater than the size of the disk block, 
we require that the meta data and the user data be in 
sync at all times. The meta data describe the 
compression characteristics, and if the information 
were to be faulty, the decompression algorithm might 
react in unpredictable ways, possibly leading to data 
corruption or a system panic. 


5.1 Shadow Paging 

To solve the problem of keeping a file’s meta data 
and user data in sync, we decided to use shadow pag- 
ing [EPP90], [GRA81], with the intent of providing 
users with a consistent view of their files at all times. 

Shadow paging is a technique that can be used 
to provide atomicity in transaction processing sys- 
tems. A transaction is defined to be a unit of work 
with the following ACID characteristics [GRA93): 
Atomicity, Consistency, Isolation, and Durability. 

The atomicity property guarantees that transac- 
tions either successfully complete or have no effect. 
When a transaction is interrupted because of a failure, 
any partial work done up to that point is undone, 
causing the state to appear as if the transaction had 
never occurred. The consistency property ensures 
that the transaction results in a valid state transition. 

The isolation property (also called serializabil- 
ity) guarantees that concurrent transactions do not see 
inconsistencies resulting from the possibly multiple 
steps that make up a single transaction. (A single 
transaction is usually made up of multiple steps.) 
DTFS uses inode locks to provide isolation. 

The durability property (also known as per- 
manence) guarantees that changes made by commit- 
ted transactions are not lost because of hardware 
errors. This is usually implemented through tech- 
niques, such as disk mirroring, that employ redundant 
hardware. DTFS does not provide permanence. 

Shadow paging provides only the atomicity 
property. It works in the following way: when a page 
is about to be modified, extra blocks are allocated to 
shadow it. When the page is written out, it is actually 
written to the shadow blocks, leaving the original 
blocks unmodified. File meta data are modified in a 
similar manner, except that the node and its shadow 
share the same disk block. 

When an inode is updated, the user data are 
flushed to disk. Then the modified meta data are 
written, followed by a synchronous write of the inode 
(the root of the B*tree). Finally, all the original data 
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Figure 5. Partial B*tree 


blocks are freed. If an abnormal system halt occurs 
before the inode is written to disk, then the file 
appears as it was at the time of the previous update. 

The shadow of a node is stored in the same 
disk block as the original node is to avoid a ripple 
effect when modifying part of a file. In Figure 5, 
node g is shadowed by node g’. When node g’ is 
added to the tree, we would have to allocate new 
nodes, c’ and f’, when we updated the nodes with the 
new pointer to g’. Then we would have to update all 
the nodes that have pointers to c and f, and so on, 
until the entire tree has been updated. 

We avoid all this extra work by design: g and 
g’ reside in the same disk block (see Figure 6). This 
means that when we decide to use g’ instead of g, the 
nodes that point to g don’t have to be updated. The 
way we determine whether to use g or g’ is by storing 
a timestamp in each node to describe which is the 
most recently modified of the pair. 

Every time we write a node to disk, the inode’s 
timestamp is incremented and copied to the node’s 
timestamp field. Before the inode is written to disk, 
its timestamp is incremented. Any nodes with a 
timestamp greater than that of the inode were written 
to disk without the inode being updated, so they are 
ignored. Otherwise, we choose the node with the 
largest timestamp of the two. In the event of a sys- 
tem failure, £sck will rebuild the block bitmap, and 
the user data duplicated by shadowing will be freed. 


5.2 Quiescence, Sync-on-Close, and the 
Update Daemon 

To increase reliability, we implemented a kernel pro- 
cess, Called the update daemon, that checkpoints 
every writable DTFS file system once per second. 
The daemon performs a sync operation on each file 
System, with the enhancement that if the sync com- 
pletes successfully without any other process writing 
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to the file system during that time, the file system 
State is set to ‘‘clean’’ and the super block is written 
out to disk. 
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Figure 6. Meta Data Shadows 


From this point until the next time any process 
writes to the file system, a system crash will not 
require that fsck be run on that file system. Studies 
of our main development machine, with DTFS on all 
user data partitions, showed that the file systems were 
in the clean state more than 98% of the time. (DTFS 
supports an ioct 1 that reports the file system state.) 

Another reliability enhancement was to write 
every file to disk synchronously, once the last refer- 
ence to the file was released. This mimics the 
behavior under MS-DOS that once a user’s applica- 
tion has exited, the machine can be turned off without 
data loss. This turned out to cause significant perfor- 
mance degradation, so instead we assigned this task 
to the update daemon. As the update daemon runs 
once every second, there can be at most a one-second 
delay after a user exits his or her applications until 
the machine is safe to power off. The original sync- 
on-close semantics are still available as a mount 
option for even greater reliability. 

At first glance, checkpointing each writable file 
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Figure 7. 


system every second appears to be an expensive pro- 
position, but in practice it tums out to be relatively 
minor. Our studies have shown that the performance 
degradation in throughput benchmarks such as the 
SPEC consortium’s SDET benchmark is on the order 
of ten percent when compared to running with the 
update daemon disabled. In addition, the update dae- 
mon only runs when there are recent file system 
modifications to be checkpointed, so if there is no 
DTFS activity, the update daemon overhead is zero. 


6. Performance 

We compared DTFS to other commercial file systems 
using the SPEC consortium’s SDET benchmark. It 
simulates a software development environment, 
measuring overall system throughput in terms of the 
total amount of work that can be performed in a given 
unit of time. We think SDET is a good benchmark to 
measure file system performance because it models 
real-world system usage, which just happens to exer- 
cise the file system quite rigorously. 

Our tests were run on an EISA-based 66MHz 
80486 machine with an Adaptec 1742 SCSI disk con- 
troller in enhanced mode, a 5400 RPM disk with an 
average access time of 9.5 ms, and 32 MB of 
memory. We used UnixWare version 1.1 as the host 
operating system. Figure 7 summarizes the results of 
the benchmark using the UFS file system (with the 
file system parameters optimized to match the disk 
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geometry and speed), the HTFS file system (one of 
our commercial file systems, a UFS-compatible 
high-throughput file system using intent logging), the 
VxFS file system (the default UnixWare file system), 
and the DTFS file system. 

DTFS did not perform as well as the other file 
systems, but its performance was still respectable. At 
2 scripts, it performs almost as well as VxFS, and at 5 
scripts it reaches 85% of the UFS throughput. At the 
peak, it attained 74% of the throughput of UFS, 83% 
of the throughput of VxFS, but only 43% of the 
throughput of HTFS. Originally we thought the per- 
formance loss was from the costs associated with 
compressing and decompressing user data. When we 
ran the benchmark with compression disabled, we 
were surprised to find that performance was slightly 
worse than with compression enabled. We attribute 
this to the additional I/O required when files are not 
compressed, and conclude that the CPU has enough 
bandwidth to compress and decompress files on-the- 
fly. The bottleneck in DTFS is the disk layout. We 
can overcome this by reorganizing our disk layout to 
perform fewer, but larger, I/O requests (a version 
number in the DTFS super block allows us to modify 
the layout of the file system and still support existing 
formats). 

These conclusions are supported by the system 
timing results summarized in Table 1. The times are 
expressed in elapsed seconds to facilitate comparison. 
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Note that with all of the file systems, the benchmark 
spent about the same time at user level (ignoring 
rounding errors), which is what we would expect for 
identical workloads. DTFS spent the most time in 
the kernel, probably because of the compression and 
decompression. DTFS also had the most wait-I/O 
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Table 1. SDET Times in Seconds (4 Scripts) 


We originally intended to illustrate the effect of 
the disparity between CPU and I/O speeds by repeat- 
ing the benchmark with a slower disk drive, expect- 
ing the performance of DTFS to be closer to that of 
the other file systems. With the current disk layout it 
is not possible to meet this expectation, because the 
wait-I/O time for DTFS dominates the system time. 
Nonetheless, we are confident that once we reorgan- 
ize the disk layout, we will be more able to support 
our belief that the overhead of performing data 
compression in the file system will become negligible 
as CPU speeds increase at a faster pace than I/O 
speeds. 

Since the SDET benchmark measures overall 
system throughput, the effects of any one component, 
such as the file system, are diminished compared to 
benchmarks that isolate that component. For the pur- 
poses of comparison, we ran isolated file system 
throughput tests on the original hardware configura- 
tion. Our read test opens a file, optionally invalidates 
any pages from that file cached in-core, and reads the 
file. The read size and the total amount of data to be 
read are configurable. Similarly, our write test 
creates a file, and writes a data pattern to it in the 
increment specified, until the file reaches the 
requested size. Then the test optionally syncs the 
data to disk and invalidates the pages in-core. 

We used a file size of 1 MB and two different 
I/O sizes to measure the raw throughput through each 
file system. The results are shown in Table 2. The 
first DTFS entry was derived using a pattern that only 
compresses by 25%. For comparison, the second 
DTFS entry shows the I/O throughput using a pattern 
that compresses by 85%. 

For small writes, DTFS is between 2.0 and 4.0 
times faster than VxFS, but is between 2.6 and 4.6 
times slower than HTFS and UFS. For large writes, 
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Table 2. Raw Throughput in KB/s 


DTFS is between a factor of 1.1 and 1.8 times slower 
than VxFS, and is between 2.5 and 4.0 times slower 
than HTFS and UFS. For reads, DTFS is anywhere 
from 2.3 to 7.0 times slower than the other file sys- 
tems, but this is hidden in a system-level benchmark 
where the page cache is active. Table 3 illustrates 
that when the page cache is used, DTFS is as fast as 
the other file systems (or, more correctly, the DTFS 
read vnode operation is as efficient as HTFS and 
UFS, and better than VxFS, since the reads are satis- 
fied entirely by the page cache). This is one reason 
why system-level benchmarks like SDET show rea- 
sonable performance for DTFS (in other words, the 
page cache works). 
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Table 4 summarizes the effectiveness of the 
DTFS compression on different classes of files. The 
comparison is against a UFS file system with a block 
size of 4KB and a fragment size of 1KB. DTFS 
doesn’t compress symbolic links, but the average size 
of a symbolic link is less than the average size of a 
symbolic link in a UFS file system. The 50% reduc- 
tion in size is because a symbolic link requires one 
fragment (2 disk blocks) in UFS, and in DTFS it only 
requires 1 to 3 disk blocks, depending on the length 
of the pathname (short pathnames are stored directly 
in the inode, and most symbolic links on our system 
happen to be short). 

When DTFS reports the number of disk blocks 
needed to represent a file, it adds one for the disk 
block containing the file’s inode. Other file systems 
store multiple inodes per disk block, so cannot attri- 
bute the space to each file as easily. This tends to 
make comparisons of small files appear as if DTFS 
cannot compress them, and some even appear as if 
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Table 4. Sample Compression Statistics 


they require more disk space with DTFS. This is 
misleading when you consider that DTFS doesn’t 
have to waste disk space preallocating unused inodes. 

For example, on a freshly-made 100000-block 
file system, UFS uses 6.0% of the disk for preallo- 
cated meta data. In comparison, VxFS uses about 
7.2% of the disk, but DTFS only uses 0.052% of the 
disk for preallocated meta data. 

The bar chart in Figure 8 illustrates the effec- 
tiveness of using DTFS as the default file system in a 
freshly-installed UnixWare system. For the other file 
systems, the default block sizes were used. 
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Figure 8. Sample Disk Usage 


7. Administrative Interfaces 
During the design of DTFS, we simplified several 
administrative interfaces, mainly because we believed 
that typical desktop and laptop users are not 
interested in becoming full-time system administra- 
tors, and even if they were, they probably didn’t have 
the necessary skills. Thus, we modified several 
administrative commands to make them easier to use. 
Our fsck doesn’t prompt the user for direc- 
tions on how to proceed. The DTFS fsck simply 
tries to fix anything it finds wrong. In addition, no 
lost+found directory is needed with a DTFS file 
system. Each DTFS inode contains a one-element 
cache of the last file name associated with the inode, 
along with the inode number for that file’s parent 
directory. When fsck finds that an inode has no 
associated directory entry, fsck creates the entry 


where it last existed. This frees users from having to 
use the inode number and file contents to figure out 
which files have been reconnected by fsck. 

The DTFS mkfs command will automatically 
determine the size of a disk partition, freeing users 
from having to provide this information. An option 
exists to allow users to force a file system to be 
smaller than the disk partition containing it. 

Our mkfs has no facility to allow users to con- 
figure the number of inodes for a particular file sys- 
tem. DTFS allocates inodes as they are needed, and 
any disk block can be allocated as an inode, so the 
only bound on the number of inodes for a given file 
system is the number of available disk blocks. 

The DTFS fsdb command is much easier to 
use than the fsdb found with the S5 file system. 
(We don’t expect most users to need fsdb, much 
less have the skills to use it, but some customers have 
requested that we provide it anyway.) The DTFS 
£sdb presents the user with a prompt, which by itself 
is a major improvement. The commands are less 
Cryptic than their S5 counterparts. For example, the 
super block can be displayed in a human-readable 
form by typing p sb. To perform the equivalent 
operation with the $5 fsdb, one would have to type 
512B.p0o0, and then interpret the octal numbers. 
The DTFS fsdb command also has facilities to log 
its input and output to a file and to print a command 
summary. 


8. Lessons Learned 

Although designed for use on small computer sys- 
tems, it turned out that many customers (including 
our own development staff) use DTFS to store large 
source file archives. Although DTFS doesn’t per- 
form as well as other file systems, performance is 
good enough for most users so that this difference 
goes unnoticed. 


8.1 Compatibility 

One interesting problem resulted from our original 
inode allocation policy. For simplicity, a file’s inode 
number is the block number of the disk block con- 
taining that file’s inode. (An exception is made for 
the file system’s root inode, which is hard-coded as 2 
for historical reasons.) Since inodes can be allocated 
anywhere on disk, we were reaching the point where 
inode numbers greater than 65535 were being allo- 
cated. This can break older (SVR3) binary applica- 
tions that use system interfaces that represent inode 
numbers as short integers, so we changed the inode 
allocation policy to bias itself in favor of lower- 
numbered disk blocks first. This didn’t solve the 
problem entirely, but it made it less likely to occur. 
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Except for ‘‘.’’ and *‘..’’, which are implicitly 
created when a directory is made, DTFS does not 
allow the creation of hard links to directories. We 
consider this to be an outdated feature since the 
advent of mkdir(2), rmdir(2), and symbolic 
links. In addition, it can wreak havoc with the file 
system hierarchy, and complicate file system imple- 
mentations. We have found no compatibility prob- 
lems with this policy. 


8.2 System V Kernel Limitations 
As stated in the overview, we found it necessary to 
provide our own routines to access the system’s 
buffer cache. The System V buffer cache interfaces 
that are used by file systems accept logical block 
numbers to identify disk blocks. The buffer cache 
routines then apply the formula 
physical block number = 
logical block number x block size 

to convert the logical block number into a physical 
block number, which is then associated with the disk 
buffer. This makes the routines useless to file sys- 
tems that have a variable block size and already 
represent disk blocks by their physical block 
numbers. Thus, we found it necessary to provide rou- 
tines that did not apply this formula. 

Another deficiency we found with the System 
V buffer cache was the lack of a way to invalidate an 
individual buffer in the cache. A routine existed to 
invalidate an entire device, but no such routine 
existed that would invalidate a single disk buffer. 
DTFS needs to invalidate a disk buffer, if it exists, 
whenever a disk block is freed. This is because the 
disk blocks represented by that buffer might later be 
reallocated to a different sized extent. This could 
result in aliasing conflicts if the old disk buffer were 
to remain in the cache. Curiously enough, merely 
setting the B_INVAL or B_STALE flag in the buffer 
header does not prevent the £s flush kernel daemon 
from writing the buffer to disk, if the buffer had been 
previously delayed-written. Thus, besides an aliasing 
problem, valid data could be overwritten by stale 
data, so we had to write a routine to allow us to 
invalidate a single buffer. 


9. Future Work 

Our future plans for DTFS include adding new 
compression algorithms for different classes of data, 
improving performance, incorporating our HTFS 
technology to improve system throughput and speed 
up fsck time, and adding features that our custo- 
mers need. An example of the latter is the ability to 
recover files removed or truncated accidentally (com- 
monly known as ‘‘undelete’’). 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


To date, DTFS has been ported to several ver- 
sions of the Unix operating system, including SVR4, 
UnixWare, Solaris, and SCO Open Desktop. We 
intend to port it to other operating systems and 
hardware architectures. 


10. Conclusions 

With the disparity between CPU and I/O speeds, per- 
forming on-the-fly data compression and decompres- 
sion transparently in the file system is a viable option. 
As the performance gap widens, DTFS performance 
will approach that of other file systems. 

DTFS almost cuts the amount of disk space 
needed by conventional Unix file systems in half. In 
the introduction, we stated that DTFS was designed 
to be a commercial file system. The biggest sign of 
its commercial success is its planned inclusion into 
future versions of SCO’s Open Desktop product. 

The architecture of DTFS lends itself well to 
the inclusion of intermediate data processing between 
the file system independent layer and the device 
driver layer of the Unix operating system. In fact, it 
only took an engineer a few days to convert DTFS 
from compressing and decompressing data to 
encrypting and decrypting data. 
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Abstract 


This paper describes the implementation of Sawmill, 
a network file system using the RAID-II storage sys- 
tem. Sawmill takes advantage of the direct data path 
in RAID-II between the disks and the network, which 
bypasses the file server CPU. The key ideas in the 
implementation of Sawmill are combining logging 
(LFS) with RAID to obtain fast small writes, using 
new log layout techniques to improve bandwidth, and 
pipelining through the controller memory to reduce 
latency. The file system can currently read data at 21 
MB/s and write data at 15 MB/s, close to the raw disk 
array bandwidth, while running on a relatively slow 
Sun-4. Performance measurements show that LFS 
improved performance of a stream of small writes by 
over a order of magnitude compared to writing 
directly to the RAID, and this improvement would be 
even larger with a faster CPU. Sawmill demonstrates 
that by using a storage system with a direct data path, 
a file system can provide data at bandwidths much 
higher than the file server itself could handle. How- 
ever, processor speed is still an important factor, espe- 
cially when handling many small requests in parallel. 


1. Introduction 


An I/O gap has arisen between the data demands of 
processors and the data rates individual disks can sup- 
ply [PGK88]. This gap is worsening as processor 
speeds continue to increase and as new applications 
such as multimedia and scientific visualization 
demand ever higher data rates. 


One common solution to the I/O bottleneck is the 
disk array, where several disks provide data in paral- 
lel. This can provide much higher performance than a 
single disk, while still providing relatively inexpen- 
Sive storage. By using a RAID (Redundant Array of 
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Inexpensive Disks), disk arrays can be made reliable 
despite disk failures. 


Even though disk arrays have the potential of 
supplying high-bandwidth data inexpensively, there 
are difficulties in making this data available to client 
machines across a network. For example, the RAID 
group at Berkeley built a prototype system called 
RAID-I, using a Sun-4/280 workstation connected to 
an array of 28 disks [CK91]. Unfortunately, the band- 
width available through the system was very low, only 
2.3 MB/s, mainly because the memory system of the 
Sun 4 file server was a bottleneck. 


To avoid the file server bottleneck, the Berkeley 
RAID group designed a storage system called RAID- 
II [DSH*94]. RAID-II uses hardware support to move 
data directly between the disks and the network at 
high rates, avoiding the bottleneck of moving data 
through the file server’s CPU and memory. 


Another problem with a RAID disk array is that 
small random writes are very expensive due to parity 
computation, which is used for reliability. One solu- 
tion is a log-structured file system (LFS) [Ros92], 
which writes everything to a sequential log so there 
are only large sequential writes. Thus, a log-structured 
file system can greatly improve performance of small 
writes. 


This paper describes the Sawmill file system, 
which has been designed to provide high bandwidths 
by taking advantage of the RAID-II architecture. 
Sawmill is a log-structured file system that is able to 
read data at 21 MB/s and write at 15 MB/s, close to 
the raw disk bandwidth. Sawmill is designed to oper- 
ate as a file server on a high-bandwidth network, pro- 
viding file service to network clients. 


Sawmill solves two problems: how to make use 
of a direct data path and how to combine a log-struc- 
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tured file system with a disk array. The key ideas in 
Sawmill are using streaming instead of caching, per- 
forming log layout on the fly, minimizing per-block 
overheads, and moving metadata over a separate con- 
trol path. 


The remainder of the paper is organized as fol- 
lows. Section 2 gives some background on RAID 
storage, the RAID-II storage system, and log-struc- 
tured file systems. Section 3 describes in more detail 
the implementation of the Sawmill file system and the 
techniques it uses to obtain high bandwidth. Section 4 
contains performance measurements of Sawmill. Sec- 
tion 5 discusses related work, Section 6 discusses 
future work, and Section 7 concludes the paper. 


2. Background 


There are three main concepts combined in the Saw- 
mill file system: the RAID disk array, hardware sup- 
port for a fast data path, and the log-structured file 
system (LFS). This section gives background infor- 
mation on these ideas. 


2.1 RAID 


A RAID (Redundant Array of Inexpensive Disks) 
combines two ideas: parallelism and redundancy 
[PGK88]. A RAID uses multiple disks in parallel to 
provide much higher bandwidth than a single disk. A 
RAID can also perform multiple operations in paral- 
lel. Redundancy in the form of parity is used to main- 
tain reliability. By storing parity, data can be fully 
recovered after a single disk failure. 


In a RAID, data is striped across multiple disks. 
We use a RAID-5 architecture, which stripes data as 
blocks. With N disks, a group of N-1 data blocks is 
striped across N-1 disks. A parity block is computed 
by exclusive-oring these N-1 blocks and the parity 
block is stored on the remaining disk. Parity is distrib- 
uted; successive parity blocks are stored on different 
disks to avoid the bottleneck of a dedicated parity 
disk. Each individual block is called a stripe unit, and 
a collection of N-1 data blocks and a parity block is 
called a parity stripe. 


For peak efficiency, a full parity stripe should be 
written at once. In this case, the data and the parity 
can be written out in parallel. If only part of the parity 
stripe is modified, parity must be recomputed from the 
data already on disk. This overhead is especially 
costly for small writes. Figure 1 illustrates how parity 
is recomputed after a small write: the old data and 
parity must be read and then the new data and the 
recomputed parity are written, resulting in 4 disk 


Data 
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Figure 1. Read, modify, write. When a small part of the stripe is 
written to disk, the new parity is computed from the old data, the 
old parity, and the new data. Thus, four disk operations are 
required. 


operations. Thus, small writes are relatively expen- 
sive with a RAID, and writes of a parity stripe are 
most efficient. 


2.2 The RAID-II storage architecture 


The Berkeley RAID project found that a high-band- 
width disk array could easily saturate the memory 
bandwidth of a typical workstation file server [CK91]. 
The RAID-I prototype used a Sun-4/280 workstation 
connected to an array of 28 disks. The bandwidth 
available through the system was very low, only 2.3 
MB/s, mainly due to the low bandwidth of the Sun 4 
file server’s memory system and backplane. 


To avoid the file server bottleneck, the follow-on 
RAID-II project built a storage system with a new 
hardware architecture designed to support disk array 
bandwidths. This hardware provides a high-band- 
width data path between the disks and the network, 
bypassing the file server. This results in separation of 
the control and data paths: the file server handles 
requests and provides low-bandwidth control com- 
mands, while the controller board provides high- 
bandwidth data movement. The controller also has 
fast memory that can be used for buffering and 
prefetching. 


Figure 2 illustrates the hardware configuration of 
RAID-II. The controller provides a fast path between 
the disks and the network. It is built around a high- 
bandwidth crossbar switch that connects memory 
buffers to various functional units. These units 
include a fast HIPPI network interface, an XOR 
engine for computing parity, and the disk interfaces. 
Although RAID-II was designed to support 40 MB/s, 
the maximum bandwidth is currently about 24 MB/s. 
Performance is currently limited by the disk controller 
and the VME link to the controller; each of the four 
disk controllers can handle about 6 MB/s. We have 
used a HIPPI network and an Ultranet to connect to 
clients. Further details of RAID-II are given in 
[DSH*94]. 
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Figure 2. The RAID-II storage architecture. The RAID-II storage system (on the right) provides high-bandwidth file access to the network clients 
(on the left). The RAID-II controller provides a fast data path between the disks and the network. The controller is built around a fast crossbar that 
connects the disks, memory, and the network. The file server, running the Sawmill file system, controls the storage system but stays out of the data 
path. The current configuration uses 16 disks on four controllers. These disk are treated as a single RAID stripe group. 


2.3 Log-structured file systems 


The third idea used by the Sawmill file system is the 
log-structured file system (LFS). A log-structured file 
system [Ros92] writes data only to a sequential log. 
(A file block does not have a fixed position on disk, 
but instead its position changes every time the block 
is rewritten.) The log is written to disk in large units, 
called segments, on the order of 1 MB in length. As 
the log fills the disk, old segments are read in to free 
up the stale data; this process is called cleaning. 


Because it writes only to the log, an LFS does not 
perform any random writes, but only large sequential 
log updates. This is an advantage with any disk sys- 
tem because it avoids the seek time and rotational 
latency of a random write. However, there is an even 
more significant advantage with a RAID due to the 
high parity overhead of small writes to RAID. By 
using an LFS, random writes can use the full sequen- 
tial write bandwidth of the disk array because they are 
grouped together into a log write. The LFS segment 
size must be a multiple of the parity stripe size for 
maximum performance. 


3. Implementation 


This section describes the implementation of the Saw- 
mill file system. Section 3.1 gives an overview of how 
Sawmill processes client requests. Section 3.2 
describes how Sawmill uses the controller memory. 
Section 3.3 explains how read performance is 
improved. Section 3.4 describes the new log layout 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


techniques. Section 3.5 describes the handling of 
metadata. 


3.1 Handling a typical request 


Clients communicate with Sawmill over a high-speed 
network. In the current implementation, client appli- 
cations are linked with a small library to use the Saw- 
mill file system. This library converts file system 
operations into socket operations. For instance, to 
open a file, the client calls a library routine, which 
opens a socket connection to the Sawmill file server. 
The client then sends the open parameters through the 
socket and receives the status from the file server. It 
would also be straightforward to provide access to 
data via NFS; however, performance would be limited 
by NFS protocol overhead. 


For a read, the client library routine sends the 
read parameters over the socket and then receives the 
data. On the server side, the read request is received 
into controller memory and copied through the RAID- 
II controller into the file server. The Sawmill file sys- 
tem determines where the data is stored on disk. The 
file server sends commands to the RAID-II controller 
to read data from disk into controller memory. When 
the first block of data has arrived from the disks, the 
file server issues commands to send the data across 
the network and to read the next blocks of data. Pipe- 
lining the transfers in this way reduces latency. Note 
that the file server only processes requests and han- 
dles control messages, while the RAID-II controller 
does the actual data movement. 
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Writes to Sawmill are handled in a similar fash- 
ion. The client sends the write parameters over the 
socket, followed by the data. The file server receives 
the write request, determines where in controller 
memory to receive the data, and informs the RAID-II 
controller to start receiving data. After a full segment 
of the log has been collected in controller memory, the 
file server commands the controller to write the data 
to disk. Again, the data is transferred from network to 
disk without passing through the file server’s memory. 


Sawmill currently uses the same network to 
receive requests and to transfer data. However, it 
would be straightforward to use separate control and 
data networks. This would typically be used if the 
high-speed network had high latency, for instance, to 
set up a connection. 


3.2 Controller memory 


The RAID-II controller board has a 32 MB memory 
buffer. This memory is used by the RAID striping 
driver and the Sawmill file system. The RAID striping 
driver uses memory for buffering, to compute parity, 
and to reconstruct data after a disk failure. The Saw- 
mill file system uses the remaining memory for sev- 
eral purposes: buffering network requests and replies, 
holding LFS segments before they are written to disk, 
buffering data during reads, and holding LFS seg- 
ments for cleaning. This section describes the use of 
controller memory for reads. Section 3.4 will describe 
the use of controller memory for layout of write data 
in the log. 


In the original Sawmill file system design, we 
planned to implement a standard file system block 
cache in controller memory. However, we decided 
against this for two reasons. First, a block cache 
requires a significant amount of CPU overhead for 
each block, to check whether a block is in the cache 
and to update the cache information with new blocks. 
Second, with RAID-II, the memory available for a 
cache would be about 20 MB; we expect this is too 
small to provide a high hit rate with high-bandwidth 
applications. Note that with a data rate of about 20 
MB/s, the lifetime of cached data would be only one 
second. In any case, client caching could provide 
most of the potential benefits of a server data cache. 


Currently, the Sawmill file system uses controller 
memory for pipelining read requests. That is, for large 
reads, disk operations are overlapped with network 
operations to hide the latency of disk and network 
operations. As data blocks are read from disk to mem- 
ory, previous blocks are being sent from memory to 


the network. We are currently adding prefetching to 
Sawmill to obtain this benefit for small operations. 


3.3 Batching of reads 


For efficiency, data should be read with a single disk 
operation whenever possible. Unfortunately, a 
sequential request might not be stored sequentially in 
the log. Fortunately, there will often be locality of 
writes, so the data may be stored in the same segment, 
even if the blocks are not stored sequentially. In this 
case, it would be more efficient to read the entire seg- 
ment and then send the pieces from memory over the 
network in the proper order, rather than perform mul- 
tiple disk operations to fetch the blocks in order. 


To group reads, the file system loops over each 
block, checking its position on disk and seeing 
whether it is part of the same segment. When it col- 
lects as many blocks as possible, it can read them all 
at once. More complex algorithms are possible to 
improve performance when reading out-of-order data. 
For instance, batching may be combined with 
prefetching to read blocks before they are requested. 


3.4 On-the-fly layout 


A log-structured file system must arrange new file 
data into the sequential log that is written to disk; this 
process is called log layout. Sawmill uses a new 
method of performing log layout, called on-the-fly 
layout, to get more efficiency. In a standard LFS 
implementation [Ros92][SBMS93], write data is 
stored in the file system block cache. A backend pro- 
cess pulls blocks out of the cache and places the 
blocks in the log. In contrast, Sawmill assigns a posi- 
tion in the log to each data block when the write 
request comes in. The file system informs the control- 
ler of this position, causing incoming data to go 
directly from the network into the proper position in 
the log. At the same time, the file system reserves 
space in the log for any needed metadata such as 
inodes. Figure 3 illustrates the two techniques. 


There are several advantages of doing layout 
when writes are received rather than as a backend pro- 
cess. First, layout for a large write can be done as a 
single operation, instead of many per-block opera- 
tions, thereby minimizing the processing overhead. 
Second, blocks do not have to be moved to and from 
the cache, but are immediately put in the proper loca- 
tion. The RAID striping driver requires the segment 
written to disk to be stored contiguously in memory. 
Thus, a copy operation would be required, which 
would significantly reduces performance. (An alterna- 
tive would be a device driver that accepts a scatter- 
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Figure 3. Comparison of log layout techniques. On the left, 
the layout uses a cache as in traditional log-structured file 
systems. On the right, the layout is done on the fly, directly 
into a write buffer. 





gather array so layout could be done by moving point- 
ers instead of blocks.) On-the-fly layout had a large 
impact on performance of Sawmill; an earlier imple- 
mentation used a block cache for writes but due to 
copying and per-block overheads, performance was 
limited to under two megabytes per second. 


Because it doesn’t use a cache, on-the-fly layout 
has a few potential disadvantages. First, data cannot 
be reorganized before it is written. For instance, if 
writes to a file are received in random order into a 
cache, they could be taken out of the cache sequen- 
tially and written in order. However, if blocks are 
placed in the log immediately, they will be written in 
the random order in which they were received. This 
will decrease performance if they are later read 
sequentially. The second disadvantage occurs if data 
blocks are modified or deleted. With a cache, only the 
live blocks will be written to disk. However, with on- 
the-fly layout, the dead blocks will have a position in 
the log and will go to disk, resulting in unnecessary 
disk traffic. We expect that these disadvantages will 
not be important in practice. Because of the high data 
rates, blocks are likely to go to disk before they would 
be modified or deleted. Also, if there were a cache on 
the clients, the client cache could do the processing 
and provide the benefits. 


3.5 Metadata movement 


Besides handling file data, the file system must handle 
various types of metadata such as information on files 
(inodes), information about where blocks are on disk, 
the directory structure of the file system, and informa- 
tion about the contents of LFS segments. With a nor- 
mal storage system architecture, the file system just 
accesses metadata as it needs it. However, with 
RAID-II any required metadata must be explicitly 
copied between file server memory and the RAID-II 
controller. Since moving data over this path is rela- 
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tively slow, metadata is cached in file server memory. 
As a result, the Sawmill file system must maintain 
consistency among metadata in file server memory, 
metadata in controller memory, and metadata stored 
on disk. 


4, Performance measurements 


This section describes performance measurements of 
the Sawmill file system. Section 4.1 describes mea- 
surements of a single request stream with requests 
that are handled sequentially. Section 4.2 describes 
measurements of multiple requests processed in paral- 
lel: these numbers indicate the additional performance 
that a disk array can obtain by handling concurrent 
requests. Performance results are given for the raw 
disk array, the disk array configured as a RAID, and 
the disk array running Sawmill. 


These performance measurements used RAID-II 
with 16 disks. The stripe unit was 64 KB, yielding a 
parity stripe size of 960 KB. In other words, writes of 
960 KB transfer 64 KB data blocks to 15 disks and a 
64 KB parity block to 1 disk in parallel. The LFS seg- 
ment size was also 960 KB. The measurements in this 
section are of random requests of various fixed sizes. 
For these measurements, the Sawmill file server was a 
Sun-4/280, which is relatively slow (about 9 SPEC89 
integer SPECmarks). 


Because we don’t have a client that can handle 
the data rates of RAID-II yet, the following perfor- 
mance tests did not transmit data across the network, 
but only transferred the data between the disks and 
controller memory. Our current client can only trans- 
fer 3 to 4 MB/s over the network. However, the mea- 
sured server CPU load to handle this network traffic 
was about 2%. Thus, the server could transfer the full 
RAID-II bandwidth over the network without a CPU 
bottleneck arising due to the network traffic. There- 
fore, the performance numbers in this section should 
be a reasonable indication of the actual network per- 
formance with a fast client. 


There are several key results from the perfor- 
mance measurements. First, Sawmill is able to pro- 
vide about 80% of the raw disk bandwidth for large 
requests. Sawmill is also able to handle a single 
stream of small read requests with performance close 
to that of the raw disk. Because of LFS, Sawmill can 
handle small writes an order of magnitude faster than 
the raw disk. However, the server CPU is a perfor- 
mance bottleneck for small writes and for multiple 
request streams. Small write performance with Saw- 
mill would improve dramatically with a faster CPU. 
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Figure 4. Read oe for a single stream of random requests 
as a function of the request size. The top graph shows bandwidth 
through the raw disk array, the disk array treated as a RAID, and the 
Sawmill file system. The lower graph shows CPU utilization and 
disk utilization through Sawmill. 





Also, Sawmill can only support about 3 or 4 concur- 
rent streams of small requests before the CPU 
becomes a bottleneck. 


4.1 Single request stream 


This section describes the performance of the raw 
disk array, the disk array treated as a RAID, and the 
Sawmill file system, when receiving a single stream 
of requests. This indicates the performance available 
to a single application. Figure 4 shows read requests 
and Figure 5 shows write requests. 


4.1.1 Raw disk array performance 


To understand the performance, the first item to exam- 
ine is the disk array. The array contains IBM 0661 
disks [IBM91]. These disks have average rotational 
latency of 7 ms and average seek time of 12 ms. Each 
disk can read and write at about 1.6 MB/s. The disk 
array was configured with 16 disks on four control- 
lers. A single SCSI string was able to read at 3.14 
MB/s with two disks. With two strings of two disks 
per controller, each controller can provide 6.24 MB/s. 
For the measurements in this section, the disk array 
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Figure 5. Write performance for a single request stream. As in 
Figure 4, the top graph shows performance of the raw disk array, 
the RAID, and Sawmill. The lower graph shows CPU and disk 
utilization with Sawmill. Note that because of LFS, Sawmill 
performance is much better than writing directly to the disks. Also 
note that small requests are totally CPU limited. This is due to LFS: 
for small writes, almost all the time is spent grouping them 
together. CPU utilization remains high, even for large wnites. 





was treated as independent disks, with no parity com- 
putation. To make the measurements comparable to 
the RAID and Sawmill measurements, requests were 
broken into 64 KB blocks and made to the appropriate 
number of disks: requests smaller than 64 KB used a 
single disk and larger requests used multiple disks. 


The “Raw” lines in Figures 4 and 5 show the per- 
formance of the raw disk array as a function of 
request size. Performance was highly dependent on 
request size. Peak read performance was 24.5 MB/s 
and peak write performance was 18.4 MB/s. For small 
requests, seek time dominated. 


With the raw disk array, the processing overhead 
is minimal, so the performance is almost entirely 
dependent on the disk speed. The time to complete a 
request to a disk can be modeled approximately as: 


20 ms + disk_request_size / peak_bandwidth 


where the measured peak bandwidth is 1.57 MB/s for 
reads and 1.16 MB/s for writes. By dividing the 
request size by this time, the bandwidth for a given 
request size can be modeled. Writes are slower 
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because of missed revolutions; the disks have a track 
buffer that minimizes missed revolution costs for 
reads. 


4.1.2 RAID performance 


In the previous section, the disk array was viewed as 
independent disks. In this section, the disks are treated 
as a RAID, but without a file system. The RAID strip- 
ing driver receives requests and breaks them up into 
stripe units spread across multiple disks. For writes, 
parity is computed. These measurements are signifi- 
cant because Sawmill is built on top of the RAID. 


The “RAID” lines in Figures 4 and 5 show the 
read and write performance of a single stream to the 
RAID device. Read performance is generally the 
same as for the raw disk. For large accesses there is a 
small performance loss, mainly from processing over- 
head and increased latency due to the RAID striping 
driver. RAID writes take at least twice as long as 
writes to the raw disk due to the read-modify-write 
parity update. Note the spikes in write performance at 
multiples of 960 KB; these sizes are multiples of the 
parity stripe size and thus don’t require a parity read- 
modify-write. 


These measurements show that for good write 
performance, the file system must perform operations 
in multiples of the parity stripe size. 


4.1.3 Sawmill performance 


The Sawmill file system, by using the techniques 
described earlier, was able to obtain peak bandwidth 
of about 80% of the raw disk bandwidth. For the mea- 
surements in this section, random reads and writes of 
various sizes were made to a single 30 MB file. 


For a single request stream, Sawmill read perfor- 
mance is Close to the raw disk performance. There is a 
performance loss for very large requests due to the 
per-request file system overhead of Sawmill and due 
to reads broken across multiple LFS segments. For 
large sequential reads, maximum performance is 21 
MB/s, slightly below the raw disk bandwidth. For 
individual small random reads, as shown in Figure 4, 
performance is limited by the file system overhead 
and the seek time of the disks, resulting in a minimum 
latency of about 20 ms per operation. 


Figure 4 shows the CPU and disk usage for Saw- 
mill read requests. The disk usage shows the percent- 
age of time the disks are in use, averaged across all 16 
disks (e.g. 8 disks in use 50% of the time is 25% utili- 
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zation). Because the measurements in Figure 4 use 
only a single stream of requests, small requests only 
use one disk at a time and cannot exceed 6.25% utili- 
zation. Note that CPU load is fairly high in Figure 4, 
but drops around 100 KB. This shape is due to two 
factors: CPU overhead per byte and the total number 
of bytes. The CPU usage per disk operation is roughly 
constant. As the request size increases to the 64 KB 
stripe unit, each disk operation transfers more data, so 
the fraction of time spent with CPU overhead 
decreases. However, as request size continues to 
increase beyond 64 KB, more disks are used in paral- 
lel, causing the CPU usage to climb again. 


Figure 5 shows write performance. Large files 
can be written to disk at about 15 MB/s. Note that 
small writes are totally CPU limited. This is due to 
LFS: since small writes are batched together and writ- 
ten sequentially, very many small writes take place 
before a disk operation occurs. Thus, the file system 
overhead to perform this batching dominates small 
write performance. As the request size increases, per- 
operation CPU time becomes less important, causing 
disk usage to climb and CPU usage to drop. These 
performance measurements omit the cost of writes 
that modify less than a file block; in this case an addi- 
tional read would be required to fetch the unmodified 
data. 


Figure 5 illustrates the benefits of a log-structured 
file system and also shows the performance lost due to 
CPU load. Because LFS groups small writes together, 
performance is about 20 times that of the RAID. This 
performance would be much better with a faster CPU; 
if there were no CPU load, the Sawmill line would be 
approximately flat around 15 MB/s, since the segment 
size written to disk is fixed. A more realistic projec- 
tion is to consider a CPU that is ten times faster, such 
as a DEC Alpha. Small write performance would 
scale almost linearly, since small writes are totally 
CPU bound. This would result in small write perfor- 
mance of 2 MB/s, about 200 times the performance of 
writing directly to the RAID. 


Figure 5 can be used to estimate the write perfor- 
mance of a file system based on the Unix FFS 
[MJLF84] running on RAID-II and compare it with 
Sawmill. If writes go to disk individually, the Unix 
file system would have performance similar to that of 
the RAID (assuming no CPU overhead for the Unix 
file system and ignoring the seeks to write metadata in 
the Unix file system), so Sawmill would be a factor of 
20 faster for small writes. A Unix file system with 
clustering [MK91] would shift performance along the 
RAID line, since the writes would be in larger units. 
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Figure 6. Read performance with multiple concurrent requests. The 
lines indicate the raw disk array as 16 independent disks, the disk 
array treated as a RAID, and the Sawmill file system. Concurrency 
of RAID and Sawmill are limited by the CPU load. 





However, Sawmill performance would still be much 
better because Sawmill writes are grouped into full 
parity stripes. With a faster processor, the benefits of 
Sawmill would be even more dramatic because small 
writes are currently CPU limited. Sawmill would have 
additional overhead due to cleaning (which will be 
discussed in Section 6.1), but the performance benefit 
of logging is likely to greatly exceed this cost. 


4.2 Multiple stream performance 


This section describes the performance of the system 
when multiple requests are processed in parallel. This 
indicates the performance if, for example, multiple 
applications or multiple clients were using the system. 
Figure 6 shows read performance with concurrency 
and Figure 7 shows write performance. 


A disk array can process multiple requests in par- 
allel if they go to separate disks. This improves the 
overall system throughput and the total number of 
operations per second, but doesn’t improve the perfor- 
mance of any particular request stream. Concurrency 
is a significant benefit for small requests, which only 
use a single disk. Potentially, 16 small reads or 8 
small RAID writes could take place in parallel, 
improving performance by the same factor. Requests 
larger than the 64K stripe unit will be striped across 
multiple disks and become inherently parallel. Thus, 
multiple requests improve the performance much 
more for small requests than large requests. 


The limiting factor to concurrency in our system 
is the CPU load. Unfortunately, our file server CPU is 
relatively slow and cannot handle the full potential 
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Figure 7. Write performance of the raw disk array, RAID, and Saw- 
mill with multiple concurrent requests. The spikes in RAID write 
performance at 960 KB and 1.92 MB illustrate the write perfor- 
mance possible when a full stripe is written at once; other writes 
require an expensive parity update. Note that because of LFS, the 
performance of Sawmill exceeds performance of the RAID. 





concurrency of the system. Only 3 or 4 Sawmill oper- 
ations can be run in parallel before the CPU becomes 
a bottleneck. Since the disk array could support 16 
operations in parallel, small operation throughput 
would increase substantially with a better processor. 


In Figures 6 and 7, the degree of concurrency 
used depended on how much concurrency the system 
could handle before saturating. The raw disks handled 
16 independent request streams. The RAID handled 8 
independent streams for small requests, with the num- 
ber decreasing as the request size increased. Sawmill 
handled 4 concurrent small read requests. For Saw- 
mill, no write concurrency was required because LFS 
results in all the disks being used in parallel. 


For the raw disk array, the CPU load is almost 
entirely determined by the number of disk operations. 
Starting a disk operation, handling the interrupt, and 
concluding the disk operation take about 1 ms in total. 
Since the SCSI controller can only handle requests up 
to 128KB, a larger logical operation results in multi- 
ple disk operations. Due to these factors, CPU load 
for 16 parallel raw disk operations is about 70% for 
small operations, drops to 20% for 128 KB requests, 
and then climbs slightly. This indicates that although 
the RAID-II architecture allows data rates greatly 
exceeding what the file server could move, our current 
file server CPU is just barely able to handle the high 
interrupt rates of small disk operations. 


Besides bandwidth, another important measure- 
ment is the total number of I/O operations the disk 
array Can support per second. Figure 8 shows the 
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Type —_|Read ops per sec |Write ops per sec 





Figure 8. Random 4KB I/O operations. This table shows the num- 
ber of I/O operations RAID-II can handle per second under various 
configurations. CPU load limits multiple RAID and Sawmill 
requests. 





number of I/O operations per second the system can 
support, and compares a single request stream with 
multiple request streams. These measurements are for 
4 KB random operations. 


In theory, by performing multiple operations in 
parallel, performance should scale with the number of 
disks. However, the main limitation on parallelism 
with RAID-II is the CPU load; handling multiple 
requests through RAID or Sawmill saturates the CPU 
before all disks are in use. The RAID striping driver 
has more processing load than the raw disk, and the 
Sawmill file system has additional load. Thus, the 
number of parallel requests possible before the CPU 
saturates is reduced. The RAID measurements were 
saturated with 8 requests in parallel, while the Saw- 
mill measurements were saturated with 4 requests. 


For a single raw disk, the number of operations 
per second is almost entirely limited by the 7 ms rota- 
tional latency and the 12 ms average seek time. With 
16 raw disks, Figure 8 shows about a 5% performance 
penalty; the CPU load is about 75%, so there is some 
delay in starting requests. 


When the disk array is treated as a RAID, the I/O 
rates change significantly. A single stream of read 
requests to the RAID uses a single disk, so perfor- 
mance is approximately the same as for a single disk. 
(Performance in Figure 8 is slightly better for the 
RAID because seek distance was shorter for the 
striped requests.) Performance of a single small write 
stream is about half of the read performance due to 
the read-modify-write parity update. With multiple 
parallel request streams, the RAID uses disks in paral- 
lel. Performance levels off around 8 request streams 
because the CPU becomes saturated. A faster CPU 
would significantly increase the total number of I/Os 
per second that the RAID can support. 


Figure 8 also shows performance of the Sawmill 
file system for small requests. Comparing the read 
measurements to the RAID with a single request 
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stream shows that there is a slight performance loss 
due to Sawmill; this is due to overhead in the file sys- 
tem. (The read latency is simply the reciprocal of the 
I/Os per second, or about 22 ms.) Due to the CPU 
load, reads obtained little benefit from concurrency. 
For writes, the number of I/Os per second is very 
good compared to the RAID, showing the benefit of 
LFS. With a faster CPU, the number of I/Os per sec- 
ond would be even higher, since performance is CPU 
limited, not disk limited. 


4.3 Scalability 


This section describes how the performance measure- 
ments are likely to change with improvements in tech- 
nology. To summarize, performance of large requests, 
small writes, and concurrent operations ‘is likely to 
improve with faster technology. However, small reads 
are limited by the performance of individual disk 
drives, which will not improve as quickly over time. 


Because small writes are limited by CPU speed, 
small write performance will scale almost linearly 
with increases in processing power. Current high-per- 
formance workstations already have ten times the 
CPU power of our Sun-4 server, and even faster 
machines will become available over the next few 
years. Thus, the performance of small writes in a 
Sawmill-like file system should improve dramatically 
over what we measured. 


Small individual random reads are limited by 
disk positioning time. Disk positioning times are 
likely to improve relatively slowly, compared to 
improvements in CPU speed. Faster CPU speeds will 
improve the potential parallelism of small reads, 
which was limited by our CPU. Increasing the number 
of disks will increase the total number of independent 
reads that can be carried out simultaneously. This will 
increase the number of I/O operations per second, but 
won’t help the performance of an individual request. 
Prefetching can be used to improve the performance 
of individual requests; if the data is fetched before it is 
required, the disk latency will not affect latency to 
handle the request. One technique for this is Gibson’s 
Transparent Informed Prefetching [GPS93]; by 
obtaining hints from the application, the file system 
can fetch data before it is required. 


Large reads and writes already achieve close to 
the raw system performance with Sawmill. Thus, 
large request performance will only increase with 
storage systems that have more or faster disks. The 
CPU speed will have to increase proportionally or 
else the CPU will become a bottleneck. However, 
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note that the current server is a Sun-4 and much faster 
CPUs are readily available. 


5. Related work 


There are several storage systems that provide high- 
performance I/O. One fundamental problem these 
systems must solve is how to provide the data without 
being limited by the file server’s CPU, memory, and 
I/O bandwidth. There are various methods that have 
been used to solve these problems. For instance, mass 
storage systems usually use an expensive mainframe, 
designed to support high bandwidth, as the file server. 
Other systems use multiprocessors or multiple servers 
to provide sufficient server power. 


5.1 Mass storage systems 


There are several high-performance mass storage sys- 
tems in use at supercomputing centers. One example 
is the LINCS storage system at Lawrence Livermore 
National Laboratory [HCF*90]; this storage system 
has 200 GB of disk connected to an Amdahl 5868. 
Another example is a system at the Los Alamos 
National Laboratory [CMS90], which has an IBM 
3090 running the Los Alamos Common File System. 


These systems solve the problem of the file server 
being a performance bottleneck by using a mainframe 
as the file server. The mainframe server is designed to 
have the I/O bandwidth, memory bandwidth, and 
CPU power necessary to provide very high data rates 
to clients. In addition, the server is connected to a net- 
work sufficient to handle very high data rates. The 
disadvantage of these mass storage systems is their 
cost. Because they are custom-designed and use an 
expensive mainframe, they are used at very few sites. 


5.2 Multiprocessor file servers 


Another approach to increasing I/O performance is to 
use a multiprocessor as the file server. Such a system 
avoids the problem of the server being a performance 
bottleneck by using multiple processors, with the 
associated gain in server CPU power and memory 
bandwidth. One example of this is the Auspex NFS 
server [Nel90]. The Auspex system uses asymmetric 
functional multiprocessing, in which separate proces- 
sors deal with the Ethernet, files, disk, and manage- 
ment. The necessary disk bandwidth is provided by 
parallel SCSI disks. However, the performance of the 
Auspex is limited by its use of NFS, Ethernet, and a 
single 55 MB/s VME bus; measurements show it can 
supply about 400 KB/s to a single client and can satu- 


rate an Ethernet network with 1MB/s per Ethernet 
connection [Wil90]. 


The DataMesh project proposed a different 
approach to multiprocessing[Wil91]. The proposed 
system would consist of a large array of disk nodes, 
where each node had a fast CPU (20 MIPS) and 8 to 
32 MB of memory. These nodes would be connected 
to a high-performance interconnection network. 


By providing multiple processors, these systems 
avoid bottlenecks from limited server CPU band- 
widths. However, this tends to be an expensive solu- 
tion, since it requires buying enough processors to 
provide the necessary memory bandwidth. Even so, 
the Auspex server still has a memory bandwidth bot- 
tleneck since it uses a single VME bus. 


5.3 Striping across servers 


High bandwidth access to very large data objects can 
also be provided by striping files across multiple serv- 
ers, so that overall bandwidth isn’t limited by the 
memory system of a single server. An example is the 
Swift system [CL91]. In this system, data is striped 
across multiple file servers and networks to provide 
more bandwidth than a single server could provide. 


A second system that stripes data across multiple 
servers is Zebra [HO93]. In Zebra, each client writes 
its data to a sequential log. This log is then striped 
across multiple servers, each with a disk. Zebra and 
Sawmill both combine LFS and RAID. The key dif- 
ference is that Zebra uses multiple servers with single 
disks and Sawmill uses a single server with multiple 
disks. Swift and Zebra avoid the file server bottleneck 
by using multiple servers, while Sawmill avoids the 
bottleneck by using a data path in hardware. There is a 
cost trade-off: Sawmill requires a special controller, 
while Swift and Zebra require multiple fast servers. 


5.4 RAID parity updates 


There are several techniques to reduce the cost of 
updating parity after a partial stripe write. One tech- 
nique is Parity Logging [SGH93]. In this technique, 
parity updates are written to a log. At regular inter- 
vals, the log is scanned and the parity modifications 
are applied to the standard RAID parity blocks. Float- 
ing Parity [MRK93] is a second technique for mini- 
mizing parity cost. In this technique, multiple parity 
blocks are reserved on disk. Updates can use the block 
closest to the current rotational position of the disk. 
The Logical Disk [dJKH93] implements a log-struc- 
tured file system at the disk level rather than the file 
system level by writing all blocks sequentially to a log 
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and maintaining the mapping between logical disk 
blocks and physical disk blocks. Since it writes blocks 
to a sequential log, it avoids random parity updates. 


6. Future Work 


Probably the most important unanswered issue with 
Sawmill is the cost of cleaning a log-structured file 
system. Another major issue is how to improve the 
performance of small reads; this was discussed earlier. 
A related issue is improving large read performance 
by data reorganization. 


6.1 Cleaning cost 


One of the key unanswered questions about Sawmill 
is the overhead due to log cleaning. Cleaning is the 
process of garbage-collecting the log to free up space. 
Cleaning is not yet operational in Sawmill, so perfor- 
mance measurements are not available. Previous work 
[Ros92] indicated that overall cleaning costs would be 
low. However, [SBMS93] found cleaning costs to be 
high for some workloads, such as transaction process- 
ing. Costs were high particularly in environments 
with largely full disks, and cleaning could potentially 
cause service interruption. The former work did not 
have dynamic performance measurements of the 
cleaner, and the latter measurements were on a largely 
untuned implementation. 


The impact of cleaning on the performance of 
log-structured file systems continues to be a topic 
requiring additional investigation, which is beyond 
the scope of this paper. At this point, there is no evi- 
dence to suggest that Sawmill will not perform well 
for the workloads typical of office environments. Fur- 
thermore, even if cleaning costs are high, Sawmill’s 
ability to avoid the “small-write problem” of RAID 
devices could well offset such overheads. 


6.2 Data reorganization 


Log-structured file systems are write-optimized; that 
is, they organize data for fast writes, even though this 
may disadvantage later reads. In particular, after ran- 
dom writes data may not be stored sequentially on 
disk, resulting in lower performance for later sequen- 
tial reads. However, by reorganizing data on disk so 
the data is sequential, the reorganized data can be 
more efficient for reads. Thus, reorganization on Saw- 
mill could improve sequential read performance. This 
reorganization could either take place during idle 
times in the system, or it could be integrated with 
cleaning so cleaned data is written back in a better 
pattern for reads. 
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7. Conclusions 


This paper has described Sawmill, a high bandwidth 
file system that uses the RAID-II disk array. Sawmill 
uses a cost-effective file server to provide high-band- 
width access to a disk array. By taking advantage of 
hardware support, Sawmill provides data rates much 
higher than the memory bandwidth of the file server. 


Measurements show that for large requests Saw- 
mill operates at close to the raw bandwidth of the disk 
array, reading data at a peak rate of 21 MB/s and writ- 
ing data at 15 MB/s, while running on a Sun-4. The 
log-structured file system improved performance of a 
small write stream by an order of magnitude over the 
RAID, and with a faster processor small write perfor- 
mance would be improved even more. The high CPU 
load in many of the performance measurements 
shows that even with hardware support, there are still 
significant demands on the file system CPU, and a fast 
processor is still required. Our current CPU was a bot- 
tleneck for small writes and concurrent operations. 


In conclusion, Sawmill shows that combining a 
direct data path, a disk array, and a log-structured file 
system is a cost-effective method of providing high- 
bandwidth storage. 
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Abstract 


This paper describes a new version of the Network File 
System (NFS) that supports access to files larger than 
4GB and increases sequential write throughput seven- 
fold when compared to unaccelerated NFS Version 2. 
NFS Version 3 maintains the stateless server design 
and simple crash recovery of NFS Version 2, and the 
philosophy of building a distributed file service from 
cooperating protocols. We describe the protocol and 
its implementation, and provide initial performance 
measurements. We then describe the implementation 
effort. Finally, we contrast this work with other dis- 
tributed file systems and discuss future revisions of 
NFS. 


1. Introduction 
“Tt is common sense to take a method and try it. 
If it fails, admit it frankly and try another. But 
above all, try something.” Roosevelt, 1932 


The NFS protocol is a collection of remote procedures 
that allow a client to transparently access files stored 
on a server /Joy84a/. It is independent of architecture 
[RFC1014], operating system, network, and transport 
protocol. The protocol does not exactly match the se- 
mantics of any existing system. Instead, it provides a 
basis for portability and interoperability. 


NFS Version | existed only within Sun Microsys- 
tems and was never released. NFS Version 2 was im- 
plemented in 1984 and released with SunOS 2.0, in 
1985 [Sandberg&5]. NFS Version 2 implementations 
exist for a variety of machines, from personal comput- 
ers to supercomputers. 


2. NFS Version 2 protocol problems 


Several problems in NFS Version 2 could only be 
solved through a new version of the protocol. The 4GB 
file size limitation has recently become a pressing 
problem, although implementations of NFS on larger 
machines such as Cray supercomputers exposed this 
limitation years ago. 
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Performance suffers under NFS Version 2 be- 
cause the protocol requires servers to write data and 
file system metadata to stable storage (usually disk) 
synchronously, before replying successfully to a client 
WRITE request /Ousterhout90]. The performance 
problem with synchronous writes was recognized ear- 
ly. NFS Version 2 has an artifact of a proposed inter- 
face for asynchronous writes (the undefined 
WRITECACHE procedure). 


Implementations have attacked this problem in 
several ways. [Moran90] describes the Prestoserve 
product, which interposes a software driver between 
the file system and disk driver to accelerate writes by 
using nonvolatile RAM. /Juszczak94] describes a 
technique called write gathering, which exploits the 
tendency of more-capable clients to send write re- 
quests in clusters to gain parallelism. The author im- 
plemented a server that gathers several writes before 
synchronously committing the data to disk, thereby 
amortizing the cost of synchronous writes over several 
requests. [Hitz94] describes an integrated file server 
design that combines a log-based file system and non- 
volatile RAM to solve the synchronous write bottle- 
neck. 


Some implementations provide an “unsafe” op- 
tion in their NFS Version 2 server that disables com- 
mitting modified data to stable storage. While improv- 
ing performance, this option violates the stable storage 
guarantee in the NFS Version 2 protocol and can result 
in data loss. This option has resulted in heated debate. 


Lack of consistency guarantees was cited as the 
cause of excessive requests over-the-wire resulting in 
increased server loading and response time 
[Howard8&]. [Reid90] and [Arnold91] describe addi- 
tional problems with NFS Version 2. 


3. The NFS Version 3 protocol 


Engineers from several companies gathered for a two- 
week series of meetings in July, 1992, in Boston, MA. 
to develop an NFS Version 3 specification. The 
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group’s goal was to address compelling issues in the 
current protocol that could not be solved by implemen- 
tation practice. The only absolute requirement was 64- 
bit file size support. 


Other issues under consideration included the fol- 
lowing: 


Solving the write throughput bottleneck 

* Minimizing the work needed to create an NFS 
Version 3 implementation given an existing NFS 
Version 2 implementation 

* Ensuring that implementation of the new protocol 

is feasible on less-capable client operating sys- 

tems (for example, DOS) 

Completely documenting the resulting protocol 

and annotating it with implementation examples 

to aid developers 

Deferring new features to subsequent revisions of 

NFS due to time constraints 


Above all, the driving principles were the following: 
° Keep it simple 

¢ Get it done in a year 

¢ Avoid anything controversial 


Although it wasn’t an absolute requirement, we 
felt that solving the write throughput bottleneck would 
provide the most compelling feature. 


3.1. Changes introduced 


NFS Version 3 represents an evolution of the existing 
NFS Version 2 protocol. Most of the original design 
features described in /Joy84a], [Sandberg85], and 
[RFC1094] persist. This revision introduces the fol- 
lowing major changes: 


Sizes and offsets are widened from 32 bits to 64 
bits. 

The WRITE and COMMIT procedures allow reliable 
asynchronous writes. 

A new ACCESS procedure fixes known problems 
with super-user permission mapping and allows 
servers to return file access permission errors to 
the client at file open time to provide better sup- 
port for systems with Access Control Lists 
(ACLs). 

All operations now return attributes to reduce the 
number of subsequent GETATTR procedure calls. 
The 8KB data size limitation on the READ and 
WRITE procedures is relaxed. 

* Anew READDIRPLUS procedure returns both file 
handle and attributes to eliminate LOOKUP calls 
when scanning a directory. 

File handles are of variable length, up to 64 bytes, 
as needed by some _ implementations 


[Pawlowski89]. (We kept the file handle size 
small enough to allow efficient DOS implementa- 
tions.) 

Exclusive CREATE requests are supported. 

File names and path names are now specified as 
strings of variable length, with the maximum 
length negotiated between the client and server 
(with the PATHCONF /POSIX90] procedure). 

The errors the server can return are enumerated in 
the specification—no others are allowed. 

The notion of blocks is discarded in favor of 
bytes. 

The new NFS3ERR_JUKEBOX error informs cli- 
ents that a file is currently off-line and that they 
should try again later, 


Appendix 1 provides a summary of the protocol differ- 
ences between NFS Version 2 and NFS Version 3. Re- 
fer to [NFS3] for more details. 


At least eight new versions of NFS have been pro- 
posed to fix NFS Version 2, none of which has ever 
been completely implemented. Public reviews of the 
draft versions of new protocol specifications have oc- 
curred continuously since early 1987. Several changes 
included in NFS Version 3 first appeared in those eight 
drafts. 


3.2. What was avoided 
“Let joy and innocence prevail.” Toys, 1993 


In the years since the NFS protocol was first described, 
implementation practice solved several problems orig- 
inally thought to require a protocol revision, although 
minor, undocumented changes were made to the pro- 
tocol without a formal revision. In practice, NFS Ver- 
sion 2 mostly works, and we tried not to break it. Ac- 
cepting common implementation practice reduced the 
number of changes needed to produce NFS Version 3. 
Minor protocol changes were cleaned up and incorpo- 
rated into this work. 


We decided to maintain the current stateless de- 
sign of NFS and not include strict cache consistency. 
When we defined NFS Version 3, research work on 
consistent versions of NFS was incomplete. Delaying 
support for 64-bit file sizes to explore adding stateful 
consistency was unacceptable. In addition, it seemed 
clear that supporting strict data consistency introduces 
complexities that would preclude implementation on 
less-capable clients. Finally, the recovery benefits of a 
stateless server were Clear, while the issues of stateful 
recovery were not. 


The stateless server design of NFS creates a prob- 
lem with the replaying of nonidempotent requests. An 
idempotent request such as LOOKUP can be successful- 
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ly executed any number of times. A nonidempotent re- 
quest such as REMOVE can be successfully executed 
only once. Primarily a correctness problem, this condi- 
tion has been solved through the use of a reply cache 
of recently serviced requests on the server 
[Juszczak89]. Proposed protocol extensions to NFS at- 
tempted to fix this but were essentially misguided. The 
Boston group simply acknowledged the effectiveness 
of this implementation technique and left the protocol 
alone. 

Many other changes to NFS Version 2 were pro- 


posed in the eight protocol revisions, including the fol- 
lowing: 


The ZERO procedure to punch holes in a file 
Append mode writes 

Record-oriented I/O support 

File name to include versions 

User and group fields as strings 

Extended attributes (arbitrary key/value pairs) 
Well-defined UID mapping procedures 
Advisory close procedure 

Resource fork support for the Macintosh 
Multiple OS-dependent name spaces 

A get server statistics procedure 


Most of the above proposed features were rejected 
because by 1992 implementers had worked around 
purported “protocol limitations” that would prevent 
implementations on non-UNIX platforms. Other pro- 
posed features above were rejected because they were 
specific to a single operating system. The remaining 
proposed features were discarded because they at- 
tempted to solve a problem simplistically that was best 
solved correctly (for example, append mode writes 
versus a full consistency protocol). 


4. Design and implementation 


NFS Version 3 defines a revision to NFS Version 2; it 
does not provide a new model for distributed file sys- 
tems. Because of this, NFS Version 3 resembles NFS 
Version 2 in design assumptions, file system and con- 
sistency model, and method of recovering from server 
crashes. For a general description of the implementa- 
tion issues of NFS, see /Sandberg85], [Israel89], 
[Juszczak89], [Pawlowski89], [Macklem91], and 
[Juszczak94]. 


4.1. NFS design 


NFS achieves architecture and operating system inde- 
pendence through a strict separation of the protocol 
and its implementation. The protocol is the interface 
by which clients access files on a server. A client or 
server implements the protocol by mapping local file 
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system actions into the file system model defined by 
NFS. The NES protocol does not dictate how a server 
implements the interface or how aclient should use the 
interface [Satyanarayanan8&9]. For example, the NFS 
Version 3 protocol does not define how a client should 
manage cached data, but it does provide information to 
improve cache management. 


Although implementations have been used to il- 
lustrate aspects of the NFS protocol, the specification 
itself is the final description of how clients should ac- 
cess servers. Semantic details that were not fully de- 
scribed in the NFS Version 2 specification /RFC1094] 
have proven, in practice, not to be a problem and have 
been worked out through interoperability testing. Most 
problems are flaws in implementations, instead of the 
protocol design. 


The NES protocol is stateless; that is, each request 
contains sufficient information to be completely pro- 
cessed without regard to other requests. The server 
does not need to maintain state about any previous re- 
quests! other than file data on stable storage, and a 
map of file handles (opaque tokens used by clients to 
access files) to files derived from file system data. Of 
course, most servers cache file data that has been syn- 
chronized to disk to improve performance. However, 
this cached data is not needed for correct operation. 


Server crash recovery is simple. A client need 
only retry a request until the server responds; the client 
does not know that the server has rebooted (although 
the user may notice delayed responses). Experience at 
Sun with network disk (nd), an earlier method of shar- 
ing disk storage on a network, led to the stateless serv- 
er requirement in the initial design of NFS /Joy8&4b]. 


The NFS Version 3 protocol requires that modi- 
fied data on the server be flushed to stable storage be- 
fore replying. Only asynchronous writes are excepted. 
NFS clients block on close(2) until all data is flushed 
to stable storage on the server, to return any errors to 
the application that might occur during delayed writes 
(for example, out of space). 


NFS clients are decidedly not stateless. NFS cli- 
ents hold modified data that has not been flushed to the 
server as well as cache file handles and attributes. Cli- 
ents typically use attribute information, such as file 
modification time, to validate cached information. 
When a client crashes no recovery is necessary for ei- 
ther the client or the server. 


' To be precise, the reply cache on a server contains volatile state 
needed for correctness [Kazar94]. See [Bhide91] for further discus- 
sion on the reply cache and its implications for server correctness. 
‘TCP-based implementations of NFS still need a reply cache to pre- 
vent destructive replay following connection re-establishment. 





139 


140 


Thus, NFS servers are stupid and NFS clients are 
smart. NFS Version 3 offers the possibility of poten- 
tially smarter clients. 


4.2. Multiple version support 


The Remote Procedure Call (RPC) protocol provides 
explicit support for multiple versions of a service 
[RFC1057]. The client and server implementations of 
NFS Version 3 provide backward compatibility with 
NFS Version 2 by supporting both NFS Version 2 and 
NFS Version 3. By default, an RPC client and server 
bind using the highest version number they both sup- 
port. Client or server implementations that cannot sup- 
port both versions (for example, due to memory re- 
strictions) should support NFS Version 2. 


4.3. Implementation issues 


A primary goal in restricting the changes between NFS 
Version 2 and NFS Version 3 was to minimize new 
implementation issues. Implementation issues exist in 
the following areas: 


° 64-bit file sizes and offsets 

¢ Asynchronous writes 

* READDIRPLUS—tread directory with attributes 
* NFS3ERR_JUKEBOX 

* Weak cache consistency 

¢ Other issues 


4.3.1 64-bit file sizes and offsets 


The 64-bit extensions in NFS Version 3 introduce 
problems with mismatched clients and servers, such as 
a 32-bit client and a 64-bit server, or a 64-bit client and 
a 32-bit server. 


A 64-bit client will never encounter a file that it 
cannot handle when using a 32-bit server. If it sends a 
request that the server cannot handle, the server should 
return NFS3ERR_FBIG. 


The problems posed by a 32-bit client and a 64-bit 
server are more difficult. The server can handle any- 
thing that the client can generate. However, the client 
cannot handle a file whose size can not be expressed in 
32-bits, and will not properly decode the size of the file 
into its local attributes structure. One solution is for the 
client to deny access to any file whose size cannot be 
expressed in 32 bits. This introduces anomalous be- 
havior when a file is extended by the client beyond its 
limit, thus rendering the file inaccessible. 


Another solution is for the client to map any size 
greater than it can handle to the maximum size that it 
can handle, effectively “lying” to the application pro- 
gram. This allows the application access to as much of 
the file as possible given the 32-bit offset restriction. 


Although this solution eliminates the anomalous be- 
havior described in the first solution, it introduces the 
problem that a client might be able to access only part 
of a file. However, other solutions exist. 


4.3.2 Asynchronous writes 


NFS Version 3 asynchronous writes eliminate the syn- 
chronous write bottleneck in NFS Version 2. When a 
server receives an asynchronous WRITE request, it is 
permitted to reply to the client immediately. Later, the 
client sends a COMMIT request to verify that the data 
has reached stable storage; the server must not reply to 
the COMMIT until it safely stores the data. 


Asynchronous writes as defined in NFS Version 3 
are most effective for large files. A client can send 
many WRITE requests, and then send a single COMMIT 
to flush the entire file to disk when it closes the file. 
This allows the server to do a single large write, which 
most file systems handle much more efficiently than a 
series of small writes. For very large files, the server 
can flush data in the background so that most of it will 
already be on disk when the comMIT request arrives. 


Asynchronous writes are optional in NFS 
Version 3, and specific client or server implementa- 
tions can choose not to support this feature. A server 
can choose to flush asynchronous write requests to sta- 
ble storage. In this case, the server indicates this in the 
WRITE reply. Clients with insufficient memory to sup- 
port the necessary buffering required for server crash 
recovery can always request synchronous writes. 


4.3.2.1 Crash recovery 


The design of asynchronous writes is consistent with 
the stupid server and smart client philosophy of NFS. 
The client is required to keep a copy of all uncommit- 
ted data to support recovery following a server crash. 
The replies for WRITE and COMMIT requests include a 
write verifier that clients use to detect server crashes. 
The write verifier is an 8-byte value that the server 
must change whenever it crashes. Servers commonly 
use their boot time as a write verifier, because it is 
guaranteed to be unique after each crash. The client 
must save the write verifier returned by each asynchro- 
nous WRITE request and compare it to the write verifi- 
er returned by a later comMMIT request. If the write ver- 
ifiers do not match, then the client assumes that the 
server has crashed and rebooted. 


The client must then rewrite all uncommitted da- 
ta. Clients can push data with synchronous writes fol- 
lowing server failure. The client can delay rewriting 
data when it detects a crash to avoid flooding a newly 
rebooted server with WRITE requests. Figure 1 shows 
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Figure 1. Client page states with asynchronous writes (The Digital OSF/1 implementation). This diagram shows the state changes that 
occur as a page of memory containing file data is modified, written, and then committed. @ A local application modifies the page, and it 
is marked dirty. @) The client asynchronously writes the data to the server. The client stores the write verifier from the asynchronous write 
request with each page. An explicit msync(2), fsync(2) or close(2) from the application, a file system sync, or page reclamation will trigger 
a COMMIT. The write verifier returned from the COMMIT request is compared against those stored with the written pages. G) The page’s 
write verifier matches the returned verifier, and the commit succeeds. ® The write verifier for the page does not match the returned write 
verifier, triggering recovery. © The client synchronously writes the data to the server. 


the state changes that occur as a page of memory con- 
taining file data is modified, written to the server, and 
then committed. 


4.3.2.2 Server details 
An NFS Version 3 server makes the following three 
guarantees: 


* For a synchronous WRITE request, the server will 
commit to stable storage all data and modified 
metadata. 


* The server will not discard uncommitted data 
without changing the write verifier. 


¢ The server will commit the file’s data and modi- 
fied metadata to stable storage for the range spec- 
ified in the COMMIT request before reporting suc- 
cess. 


Other conditions arise in which the write verifier 
must change. For example, the server must change the 
write verifier on failover if NFS Version 3 forms the 
basis of a non-shared memory, highly available imple- 
mentation of NFS /Bhide9/]. The unsynchronized 
data is not available to the backup processor, and there 
is no guarantee that the primary processor was able to 
flush uncommitted data to stable storage before going 
down. 


If a server is shut down cleanly, it could be advan- 
tageous to save the write verifier for reuse when the 
server is brought back on line. This avoids triggering 
client rewrites of already committed data. 
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4.3.2.3 


Asynchronous writes make write sharing without us- 
ing a higher-level application synchronization proto- 
col even less attractive than with NFS Version 2. NFS 
Version 3 clients preserve close-to-open consistency: 
clients typically block on a close(2) until all data is 
flushed to server stable storage and revalidate cached 
data with an attribute check on open(2). Strictly speak- 
ing, Close-to-open consistency is only an implementa- 
tion practice. Data sharing semantics of NFS 
Version 3 differ from those of NFS Version 2 if an 
NES Version 3 server reboots and loses uncommitted 
data. Because write sharing between NFS Version 2 
clients was never supported in the absence of locking, 
changes in essentially undefined behavior is not con- 
sidered a major issue. 


Data sharing 


4.3.3 READDIRPLUS 


NFS Version 3 contains a new operation called 
READDIRPLUS, which returns file handles and at- 
tributes in addition to the directory information re- 
turned by READDIR. 


READDIRPLUS exploits observed request se- 
quences generated by NFS Version 2 clients. For ex- 
ample, when a UNIX user types “ls -F dir” to 
browse a directory containing 20 entries, the 1s com- 
mand opens the target directory, reads it, and then calls 
stat(2) 20 times. In NFS Version 2, a READDIR request 
would be followed by 20 sequential Lookup requests 
to retrieve attributes (and file handles). In NFS Ver- 
sion 3, a single READDIRPLUS retrieves the name list 
and attributes for the 20 entries, significantly reducing 
the command execution time. 
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There are some drawbacks to READDIRPLUS, 
however. A READDIRPLUS is more expensive than a 
corresponding READDIR. Results from an implementa- 
tion that generates exclusively READDIRPLUS requests 
show a performance drop because attributes for all en- 
tries in a directory are fetched repeatedly for every ac- 
cess to a directory. 


The READDIRPLUS operation can be viewed as a 
way to get the contents of a directory and to populate 
name and attribute caches for the entries in that direc- 
tory at the same time. The READDIRPLUS operation 
should be used only when reading a directory for the 
first time or when rereading a directory whose cache 
entry was invalidated because the directory was mod- 
ified. A READDIRPLUS should not be issued when a 
valid cache entry for a directory exists, because it is 
likely that a READDIRPLUS operation was recently is- 
sued to populate the various caches with directory en- 
try attributes and file handles. 


4.3.4 NFS3ERR_JUKEBOX 


NFS3ERR_ JUKEBOX” lets servers inform clients that a 
file is temporarily inaccessible (archived offline or 
locked against modification for backup) and that they 
should retry the request later. It is intended to improve 
the behavior of NFS in hierarchical storage manage- 
ment applications. 


In NFS Version 2, a server performs one of three 
actions if a file is temporarily inaccessible. The first 
action is to drop the request, which forces the client 
into normal back-off and retransmission. The request 
will be satisfied at some later time on a retry. The sec- 
ond action is to have the server block a service thread 
until the file again becomes accessible. The second ac- 
tion is often implemented inadvertently; because cli- 
ents employ mechanisms like biods to gain parallelism 
and will emit several related requests to one file, 
blocking server threads can hang the server. The third 
action is to return some error to the client, thus reject- 
ing the request. 


An NFS Version 3 _ server returns 
NFS3ERR_JUKEBOX when a file is temporarily inac- 
cessible. The client operating system does not return 
the error to the application but handles it internally by 
aggressively delaying reissue of the request, thereby 
reducing server load due to request retransmission. Af- 
ter a tunable delay, the request is reissued. The client 


2 The term “JUKEBOX” is a long standing joke in the NFS commu- 
nity. We kept the historical error name even though it incorrectly im- 
plies a binding to a particular HSM mechanism. Given the generic 
intent of the error, NFS3ERR_TMP INACCESSIBLE would be 
more appropriate. 


should reissue the request with another transmission 
id. 


4.3.5 Weak cache consistency 


Many NFS Version 2 clients cache file and directory 
data to improve performance. To determine whether 
cached data is valid, a client sends a GETATTR request. 
If the new modification time from the server matches 
the modification time in the client’s cached attributes, 
then the client assumes its cache is up-to-date. If the 
modification times don’t match, then the file must 
have changed, and the client invalidates its cache. 


This method fails when the client itself modifies 
the file being cached. For example, if a client writes to 
one part of a file, cached data for other parts is proba- 
bly still valid. But it is impossible for the client to be 
sure, because the client’s own WRITE request updated 
the file’s modification time. A reckless client might 
keep the cache data (which is dangerous), and a cau- 
tious client might invalidate the cache (which is slow). 


Weak cache consistency offers an alternative by 
helping clients determine more accurately when to in- 
validate their cache. The reply for each NFS Version 3 
request that can modify data includes two versions of 
the file’s attributes: pre-operation attributes from just 
before the server performed the operation and post-op- 
eration attributes from just after the operation. If the 
modification time in the pre-operation attributes from 
the server matches the cached attributes on the client, 
then the client’s cache is valid. The client should up- 
date its attribute cache with the new post-operation at- 
tributes. 


Weak cache consistency does not provide true 
consistency such as found in Sprite /Nelson88]. With 
weak cache consistency, clients might see an inconsis- 
tent view of server data. For example, one client might 
have modified a file locally but not yet flushed the new 
data to the server. Even if it has, a second client will 
only verify modification times when a file is first 
opened or when the cached attributes time out. As a re- 
sult, a second client’s cache may be out of date. 


Some servers may be unable to generate pre-oper- 
ation attributes, so clients should be prepared to fall 
back to NFS Version 2 behavior. Since weak cache 
consistency is just a hint, client implementations are 
free to use it or ignore it. 


4.3.6 Other issues 


Two changes in NFS Version 3 impose extra work on 
the client. For many NFS Version 3 requests, it is op- 
tional to return file handle and attribute information 
that is mandatory in NFS Version 2. For example, in 
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NFS Version 2, the CREATE request must return the 
file handle and attributes for the newly created file, but 
in NFS Version 3, their return is optional. As a result, 
an NFS Version 3 client must be prepared to issue a 
LOOKUP after each CREATE, in case the server does not 
return a file handle for the new file. Furthermore, in 
NES Version 3, it is optional for LOOKUP to return at- 
tributes, so the client must also be prepared to issue a 
GETATTR. 


NFS Version 2 servers are required to accept all or 
none of the data in a WRITE request. In NFS Version 3, 
a server can accept only some of the data in a write, 
and the client is expected to send the rest a second 
time. For example, a client might send an 8192 byte re- 
quest, but a server might choose to accept only 1 byte. 
The client must be prepared to send the remaining 
8191 bytes a second time, and again, the server might 
choose not to accept the entire request. 


In practice, these features are unlikely to be a 
problem because most server implementations will al- 
ways return optional information and accept the entire 
contents of WRITE requests. 


4.4, Changes to related protocols 


NFS Version 3 continues the philosophy of building a 
network file service from a collection of cooperating 
protocols. The mount protocol (MOUNT) allows an 
NFS client to gain access to an exported directory on a 
server, and the network lock manager protocol (NLM) 
supports remote file locking for NFS. 


Changes to the file handle and file size fields in 
NFS Version 3 required corresponding changes in 
MOUNT and NLM, so new versions of both protocols 
have been released. The new MOUNT specification 
allows a successful mount to return a list of acceptable 
RPC authentication flavors (such as DES or Kerberos) 
for the client to use. Automounter facilities can use 
this information to correctly access servers which re- 
quire certain flavors of authentication. The new 
MOUNT protocol is also slightly cleaner than the pre- 
vious one. For example, legal error values have been 
enumerated instead of allowing any UNIX error num- 
ber. 


5. Performance 


A major goal of NFS Version 3 was to improve perfor- 
mance, especially in write throughput. Performance 
was improved by the following: 


¢ Providing reliable asynchronous writes 

¢ Removing the 8KB data size limitation for READ 
and WRITE requests 

¢ Providing a READDIRPLUS procedure that returns 
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file handles and attributes with directory names 

¢ Returning attribute information in all replies 

° Providing weak cache consistency data to allow a 
client to more effectively manage its caches 


5.1. Test setup 


We measured Digital’s OSF/1 implementation of NFS 
Versions 2 and 3. The local file system employed for 
these tests was the Berkeley Fast File System with en- 
hanced clustering [McVoy9]]. Except where noted, 
the following configuration was used to generate the 
performance results: 


Two Digital Model 3000/600 96MB workstations 
Private FDDI network 

Server running 16 nfsds (multiple threads of exe- 
cution used on an NFS server to gain parallelism). 
Client running 7 nfsiods (or biods—multiple 
threads of execution used on an NFS client to gain 
parallelism) 

With and without Prestoserve on server, using 
IMB NVRAM 

With and without write gathering on server 
Server configured with one 1GB RZ26 SCSI disk, 
2.3 MB/sec raw transfer rate. 


The tests ran with NFS running on top of UDP 
with a maximum transfer size of 8KB. The larger 
transfer sizes permitted by NFS Version 3 were not ex- 
ploited. Measurements at SunSoft on a system using 
larger than 8KB transfer sizes showed improved write 
throughput, presumably from the reduced file system 
overhead resulting from fewer separate I/O requests 
and fewer RPC messages over-the-wire. 


5.2. Sequential write throughput 


Figure 2 shows the results of writing a 1OMB file over 
a private FDDI network using NFS Version 2 and NFS 
Version 3 protocols and varying the server configura- 
tion to enable/disable Prestoserve acceleration and 
server write gathering. We consider the NFS 
Version 2, no write gathering, no Prestoserve configu- 
ration to be the average NFS write throughput avail- 
able today. We believe that the NFS Version 2, write 
gathering, Prestoserve configuration provides compet- 
itive NFS write throughput. We observe the following: 


* NFS Version 2 with Prestoserve and NFS 
Version 3 delivers the maximum raw device rate 
to the remote client. 

* NFS Version 3 with asynchronous writes at 
2323 KB/s delivers only 1% less throughput than 
NFS Version 2 with Prestoserve and write gather- 
ing at 2346 KB/s, but it consumes 36% less server 
CPU. 
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Figure 2. Comparisons of 10MB file writes over FDDI, Digital OSF/1, Digital 3000/600 


At 2323 KB/s, NFS Version 3 is seven times fast- 
er than NFS Version 2 at 320 KB/s for a typical 
configuration with no write gathering and no 
Prestoserve. 


The NFS Version 3 client emitted only asynchro- 
nous writes in these tests; therefore, server write gath- 
ering had no effect. This configuration is not shown. 
Prestoserve further improves NFS Version 3 asyn- 
chronous writes because there is a synchronous com- 
ponent to writing metadata during local file system 
clustering. NFS Version 2 with Prestoserve provides 
higher throughput on a single disk system than NFS 
Version 3, because Prestoserve masks the cost of re- 
duced cluster transfer sizes and missed rotations seen 
in its absence. Multiple spindles can help mask these 
effects in the absence of accelerator hardware. 


It was clear that the disk was the bottleneck for the 
above test, given the low CPU utilizations, available 
network bandwidth on FDDI (100 Mbit/s), and the raw 
speed of the disk. To remove the disk bottleneck, we 
made a second set of runs, sequentially writing a 
40MB file, with the following configuration changes: 


Server configured with four 2GB RZ28 SCSI 
disks, each 4.8 MB/sec raw transfer rate, four-way 
striped 

Client running 15 nfsiods (or biods) 

The results in Figure 3 show that for sufficiently 
large files on a non-disk bound server, NFS Version 3 
delivered 6105 KB/s, compared to an NFS Version 2 
server with Prestoserve and write gathering that deliv- 
ered 5022 KB/s. NFS Version 3 delivered 22% more 
throughput at a similar server CPU utilization. The 
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File and directory removal 
remove 155 files 62 directories 5 levels deep 
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Table 1: Connectathon Basic test suite results, 7 biods, single disk spindle, (results in seconds, except as noted) 


maximum throughput of 6425 KB/s was achieved with 
NES Version 3 and Prestoserve. Throughput increased 
with this configuration change, but not to the point of 
the disk bandwidth limitation or CPU exhaustion. The 
bottleneck moved to the network because of the limit- 
ed number of stations, limited application parallelism, 
and FDDI token holding time characteristics of the 
network interfaces. 


We conclude that asynchronous writes improve 
both client throughput and server efficiency. They pro- 
vide most of the benefits associated with running an 
NFS Version 2 server in “unsafe” mode, while ensur- 
ing data reliability after server failure®. Prestoserve 
should still accelerate small file writes, as well as other 
modifying requests like CREATE and REMOVE. 


5.3. Connectathon test suite results 


Because the LADDIS benchmark generates NFS Ver- 
sion 2 RPC calls directly to measure server perfor- 
mance /Wittle93], it cannot measure NFS Version 3 
without modification. As an alternative, we ran the 
Connectathon test suite, which was developed to test 
the interoperability of NFS implementations. It runs 


; [Nelson&8b] suggests that unsafe writes would provide greater 
throughput than asynchronous writes with close-to-open consisten- 
cy. That is, assuming that COMMIT blocks until all remaining data is 
on disk when a file is closed, unsafe mode implementations which 
do not block would clearly perform better. For large files, this effect 
should be minimal. 
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on the client on a remotely mounted directory and ex- 
ercises both client and server NFS code. It consists of 
three passes that cover the basic functionality of a file 
system. The Basic pass isolates specific features of the 
client file system, and consists of ten separate tests. 
Testing a single client file system feature typically 
generates a mix of NFS requests. The General pass 
runs multiple simultaneous large compiles, as well as 
nroff(1). The Special pass exercises boundary cases in 
NES operations. 


Table 1 contains the results of running the Con- 
nectathon test suite. We conclude the following from 
these results: 


¢ Again, NFS Version 3 asynchronous writes are 
clearly a win (see test 5a). 


° Prestoserve remains useful on the server for other 
metadata operations (CREATE, REMOVE, etc.), as 
shown by tests 1, 2, 4, 6, 7 and 8. Test 6 performs 
file deletions in addition to reading directory en- 
tries, which explains the improvement with Pres- 
toserve. 


° NES Version 3 reduces the total number of RPC 
messages by 18% compared to NFS Version 2. 
The reduction is due entirely to the increased fre- 
quency of returned attributes and better cache 
management through weak cache consistency da- 
ta. This reduction more than offsets the calls to the 
new ACCESS and comMrIT RPC procedures. 
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NFS Version 2 


calls 

21865 

null getattr setattr root 

0 0% 4058 183% 1168 5% 0 0% 

wrceache write create remove 

0 0% 1881 8% 675 3% 14i3 34 

mkdir rmdir readdir statfs 

173 0% 173 0% 972 4% 1755 8% 
NFS Version 3 

calls 

17764 

null getattr setattr lookup 

0 0% 1282 7% 1168 6% 5499 30% 

write create mkdir symlink 

1881 10% 675 3% 173 0% 250 1% 

rename link readdir readdir+ 

352 ix 250 1% 758 4% 18 0% 


lookup readlink read 

6954 31% 250 1% 1779 8% 

rename link symlink 

352 1% 250 1% 250 1% 

access readlink read 

309 1% 250 1% 1731 9% 

mknod remove rmdir 

0 0% 1175 6% 173 0% 

fsstat fsinfo pathconf commit 
1755 9% 0 0% 0 0% 65 0% 


Figure 4. Detail of RPC counts for all three passes of the Connectathon Test Suite 





The read throughput results from test 5b reflect 
over-the-wire data transfers. Test 5b was modified to 
use the mmap(2) system call to invalidate the client’s 
data cache, forcing the requests to go over-the-wire. 
However, the data was cached on the server. The de- 
tailed RPC counts for the NFS Version 2 and Version 
3 results are shown in Figure 4. 


5.4. find(1) results 


The find(1) command was used to measure the effect 
of READDIRPLUS. find(1) scanned a remote file tree 
containing 9612 files distributed over 155 directories 
that were up to seven levels deep. The results are 
shown in Table 2. The over-the-wire byte counts in- 
clude all protocol headers. 


Using READDIRPLUS to fetch file handles and at- 
tributes of entries in a directory reduces the find(/) ex- 
ecution time by 36%, compared to NFS Version 2. Re- 


duced execution time can be attributed primarily to the 
tenfold reduction in over-the-wire messages. The 155 
GETATTR requests are generated to ensure close-to- 
open consistency when opening a directory. Using the 
READDIRPLUS procedure in NFS Version 3 reduced 
the total bytes transferred over-the-wire by 43% and 
the cumulative server CPU (percent utilization x 
elapsed time) by 46%, compared to using the READDIR 
and LOOKUP procedures in NES Version 2. 
READDIRPLUS is Clearly a win in this example. 


The test was rerun with READDIRPLUS disabled in 
NFS Version 3. The last column in Table 2 shows 
these results. Disabling READDIRPLUS increases exe- 
cution time by 95%, compared to the NFS Version 3 
result with READDIRPLUS enabled. More disturbing, 
execution time increased by 24%, compared to the 
NES Version 2 results. We attribute this to the new 
ACCESS procedure and to larger message sizes in NFS 
Version 3, which increased the total bytes transferred 





Table 2: find(1) results 
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by 34% when compared to NFS Version 2. Message 
sizes increased because new fields were added and old 
fields were widened. 


This result illustrates a fundamental tradeoff in 
the NFS Version 3 design: increased RPC request and 
reply sizes are to be offset by new features in the pro- 
tocol. Naive implementations that fail to use the new 
features will perform worse for some benchmarks than 
NES Version 2, but effective use of new features will 
increase overall performance. 


6. Cost of porting 


The Digital OSF/1 implementation illustrates the ef- 
fort and cost to port the SunSoft NFS Version 3 refer- 
ence source into an existing Version 2 implementa- 
tion. The source code size of an implementation that 
supports both protocols is roughly 30,000 lines (C 
code + comments + white space). The Version 2 and 
Version 3 specific portions of the total are about 
12,000 lines each, with 6,000 lines of shared subrou- 
tines. Assuming engineers familiar with NFS Version 
2, the effort needed to produce an implementation that 
supports both versions of the NFS protocol for initial 
testing is the following: 


server 1 person-month 
client (excluding 
asynchronous writes) 


client asynchronous writes 


Digital’s OSF/1 based kernel uses a unified page 
cache managed by the virtual memory subsystem for 
both program text and file data. This complicated the 
client implementation of asynchronous writes because 
of dependencies on data structures and interfaces in 
the virtual memory system. 


2 person-months 


1 person-month 


7. Related work 
“Look on my works, ye Mighty, and despair!” 
Ozymandias, Shelley, 1817 


The NFS Version 3 protocol mitigates the need for 
NFS-specific write gathering techniques on clients 
that support asynchronous writes, because a server can 
now simply process clusters of related asynchronous 
writes as part of its local buffered file system activity 
[McVoy91]. However, NFS-specific write-gathering 
on servers is still useful in supporting less-capable 
NFS Version 3 clients that do not support asynchro- 
nous writes or more-capable clients that resort to syn- 
chronous behavior during recovery. The stable storage 
semantics for metadata modifying operations, such as 
CREATE, remain unaffected by NFS Version 3. Thus, a 
server can still benefit from fast stable storage. To a 
lesser extent, fast stable storage techniques still im- 
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prove asynchronous WRITE performance, especially 
for small files. 


Adaptive retransmission strategies to improve the 
behavior of NFS over UDP (as described in 
[Nowicki89], derived from [Jacobson8&&]) and the use 
of TCP to improve performance over wide area net- 
works [Macklem9]], are applicable to NFS Version 3. 
NFS Version 3 relaxes the 8KB limitation on the data 
portion of a READ or WRITE request, permitting more 
efficient use of TCP. 


Three efforts to revise the NFS protocol are relat- 
ed to this work. The first is Spritely NFS, described in 
[Srinivasan89], [Mogul92], and [Mogul93]. Spritely 
NFS uses a stateful server that controls client caching 
behavior to ensure consistency. State recovery follow- 
ing acrash is server-driven. The server keeps a nonvol- 
atile list of old clients that are contacted during a grace 
period following reboot to initiate the rebuilding of 
state on the server. Spritely NFS employs consistency 
to address performance issues in NFS Version 2 by al- 
lowing clients to defer writes and by eliminating the 
need for clients to poll the server to detect file changes. 


The second effort is NONFS /[Macklem94], which 
defines extensions to NFS Version 2 that are similar to 
those found in NFS Version 3. Size and offset fields 
were widened to 64 bits, and a READDIRPLUS proce- 
dure was added. Time-based leases provide a mecha- 
nism for data consistency and cache coherence among 
clients. Clients need to anticipate lease expiration. Cli- 
ents do not have special recovery code. Instead, leases 
are short enough to expire while the server is reboot- 
ing, forcing clients to request renewals (thereby driv- 
ing recovery) from the newly rebooted server. On re- 
boot, a server accepts only writes during a grace peri- 
od, after which it will grant new leases. 


While the results of both NQNFS and Spritely 
NFS looked promising at the time we defined NFS 
Version 3, both were unfinished. We decided that add- 
ing consistency to NFS was contrary to our minimalist 
goals and best left for a subsequent revision. 


The third effort, [Fadden92] and [Glover92], de- 
scribed Trusted NFS (TNFS), which defines a method 
for handling ACLs and data labels that conserves 
space. Acknowledging that security data can be large, 
TNES maps the data into opaque tokens and requires a 
separate token mapping service to convert to and from 
a canonical over-the-wire format. We decided not to 
incorporate this work into NFS Version 3 because of 
instability in the POSIX ACL specification and the rel- 
ative immaturity of extant implementations of TNFS. 


DCE DFS /[Kazar90] is related to NFS Version 3 
only in that it describes an amount of effort that we 
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clearly did not want to undertake. Our primary goals 
were to improve NFS Version 2 and deploy a new ver- 
sion quickly. We preferred to retain the ease of server 
crash recovery, at the expense of not supporting some 
of the more valuable features of DCE DFS. 


8. Future work 


The strategy for using READDIRPLUS needs further re- 
search. Reading the contents of a very large directory 
with READDIRPLUS can eject potentially more valu- 
able entries from client caches. Finding heuristics to 
guide choosing between READDIR and READDIRPLUS 
is hard because an NFS client cannot tell whether an 
application will need attribute information for a direc- 
tory’s children or not. More experience could lead to 
better heuristics than the simple ones used now. 


An NES Version 3 client trying to do effective 
cache management with weak cache consistency re- 
quires that the server guarantee atomicity of modifying 
operations and pre- and post-operation attribute gener- 
ation. The performance cost of supporting such atom- 
icity on the server is not fully understood, particularly 
for multiprocessor server implementations where ex- 
tensive locking could result in unwanted serialization. 
More analysis is needed. Weak cache consistency with 
the WRITE procedure provides no useful sharing se- 
mantic. 


Additional characterization and tuning of NFS 
Version 3 under more complex workloads is needed. 
An NFS Version 3 LADDIS benchmark is needed. 
Tuning NFS Version3 implementations should not 
pose insurmountable problems. 


We did not expect the NFS Version 3 specifica- 
tion to be perfect. Our hope is that the protocol speci- 
fication will grow to reflect common practice and pro- 
vide guidelines on conforming behavior. The develop- 
ment of an NFS Version 3 Validation Suite by SunSoft 
will aid interoperability. Finally, interoperability test- 
ing of implementations at Connectathon remains the 
comerstone of successful file sharing with NFS. 


8.1. NFS Version 4 


In defining NFS Version 3, we assumed that other pro- 
tocol revisions would follow, allowing us to defer fea- 
tures. Improved data and cache consistency is an obvi- 
ous candidate for NFS Version 4. POSIX write-shar- 
ing semantics exist today on a single NFS client. NFS 
Versions 2 and 3 support a client-driven bounded 
time-based model for write sharing /Kazar88], with 
close-to-open consistency. This model does not pro- 
vide sufficient guarantees for concurrent write-sharing 
between cooperating clients in the absence of explicit 


locking. The fact that write-sharing is infrequent even 
in those distributed file systems that support it 
[Welch90] is areason NFS has been successful despite 
this limitation. Both Spritely NFS and NONFS dem- 
onstrate how to provide stronger consistency guaran- 
tees with a provision for server and client crash recov- 
ery. Both approaches depend on the clients to re-estab- 
lish state after server reboots. 


Disconnected operation of fixed and nomadic cli- 
ents is a potential area for future work. More investi- 
gation is required on how consistency guarantees 
work, if at all, in the presence of clients disconnected 
longer than the lease terms or callback timeouts used 
by NQNFS or Spritely NFS, respectively. 


Stronger security models in NFS are another area 
for future work. More research is needed on whether to 
pursue trusted system support in general. 


The problems of consistent name space construc- 
tion and increased availability are areas of research for 
future protocol revisions and are perhaps best solved 
with innovative implementations using existing proto- 
cols. 


9. Conclusions 


The constrained NFS Version 3 effort addressed the 
following concerns with NFS Version 2: 


* 64-bit file sizes are now supported. 

* Asynchronous writes increased throughput seven- 
fold over unaccelerated NFS Version 2 imple- 
mentations. 

* Over-the-wire traffic measured both by RPC 

counts and network loading has been reduced. 

Directory browsing is faster, with less network 

loading and lower CPU utilization. 

Performance improvements were achieved de- 

spite the size increase of the file attribute struc- 

tures resulting from 64-bit file size support. 

* Many “minor annoyances” of the NFS Version 2 
protocol have been corrected. 


NES Version 3 was specified, reviewed, proto- 
typed, verified, and supplied by multiple vendors for 
external testing in less than 24 months from the initial 
Boston meetings. At Connectathon in 1993, prototype 
implementations interoperated successfully. We 
achieved the goal of providing measurable improve- 
ments over NFS Version 2 with little effort required to 
create an implementation. 


There is more work to be done. NFS Version 3 of- 
fers the potential for better name and attribute cache 
management than is possible with NFS Version 2. Re- 
alization of this potential is a current and future effort. 
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9.1. Availability 


The NFS Version 3 protocol specification draft can be 
obtained from bem.tmc.edu, gatekeep- 
er.dec.com and ftp.uu.net using anonymous 
FIP. 


NES Version 3 will be available in the next major 
release of Digital’s OSF/1. Servers will fully support 
NFS Version 3, as well as provide NFS Version 2 for 
interoperability with older clients. At SunSoft, a So- 
laris 2 implementation of NFS Version 3 that supports 
TCP and large transfer sizes is in early deployment and 
will shortly go to external field test. In addition, a ref- 
erence implementation of NFS Version 3 with TCP 
support is undergoing final testing. Early access to the 
reference implementation from SunSoft will occur this 
summer. Other implementations are in progress. Con- 
tact your vendor for further information. 


SunSoft is developing an NFS Version 3 Protocol 
Validation Suite to provide a tool to help ensure in- 
teroperability of clients and servers. This validation 
suite will be made available for licensing. 
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Clue Tables: A Distributed, Dynamic-Binding 
Naming Mechanism* 


Cheng-Zen Yang, Chih-Chung Chen, and Yen-Jen Oyang 


Department of Computer Science 
and Information Engineering 
National Taiwan University 
Taipei, Taiwan, R.O.C. 


Abstract 


This paper presents a distributed, dynamic nam- 
ing mechanism called clue tables for building highly 
scalable, highly available distributed file systems. The 
clue tables naming mechanism is distinctive in three 
aspects. First, it is designed to cope well with the hier- 
archical structure of the modern large-scale computer 
networks. Second, it implicitly carries out load balanc- 
ing among servers to improve system scalability. Third, 
it supports file replication and dynamically designates 
a primary copy to resolve possible data inconsistency. 
This paper also reports a performance evaluation of the 
clue tables mechanism when compared with NFS, a 
popular distributed file system. 


1 Introduction 


Distributed file systems are the backbone of the modern 
network computing environment, The naming mecha- 
nism in a distributed file system maps the logical name 
of each individual file to its physical location. In the 
design of a modem distributed file system, availabil- 
ity and scalability are two essential concems [1, 2]. 
In order to build a highly available, highly scalable 
distributed file system, a designer must incorporate a 
naming mechanism that can cope with these concerns. 


To meet the demands, a naming mechanism must 
be distributed in nature, and support file replication and 
dynamic binding. The naming mechanism must be dis- 
tributed in nature because centralized naming mecha- 
nisms suffer low scalability due to limited capacity of 
the central naming server. The naming mechanism 
must support file replication because file replication 
improves both availability and scalability of the sys- 
tem. The presence of replicated file copies prevents 


*This research was sponsored by National Science Council of 
R.O.C. under grant NSC 83-0408-E-002-002 


service disruption due to failure of a single file server 
and thus improves system availability. With file repli- 
cation, the scalability of the system is improved be- 
cause clients can access replicated file copies on dif- 
ferent servers to avoid congestion of a particular file 
server. To maximize the benefits of file replication, 
the naming mechanism must support dynamic binding. 
With dynamic binding, the clients that initially turn to 
a crashed server for file service can establish new con- 
nections on-the-fly to other servers that have replicated 
copies of the files. Also, clients can dynamically select 
a server for binding to achieve a good load balancing 
among servers. 


In this paper, we propose a new distributed, 
dynamic-binding naming mechanism called clue ta- 
bles. The clue tables mechanism offers the basis to 
build a highly available, highly scalable distributed file 
system and is distinctive in three aspects: 


1. It is designed to cope with the hierarchical struc- 
ture of modern computer networks — 
In a modern computer network, particularly a 
large-scale computer network, bridges are com- 
monly installed to partition the network into a 
number of clusters. For example, the network in 
a research institute may be partitioned so that the 
computers in each laboratory form a local cluster. 
The hierarchy of network partitions may extend 
over several levels. The major distinction of the 
clue tables mechanism is that it was designed 
to cope with the hierarchical network structure. 
With clue tables, we can make clients turn first 
to local file servers to locate a file. If a client 
can not find the file on local servers, or the lo- 
cal servers that store the file are unavailable, e.g. 
crashed, then the client will automatically go one 
level up in the network hierarchy to locate the 
file on remote servers. The main reason behind 
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adopting this practice is to make most bindings 
among Clients and servers occur in conformity 
with network hierarchy design. 


In reality, this is the main distinction between the 
Clue tables mechanism and the similar prefix ta- 
bles mechanism [3, 4]. With prefix tables, a client 
that is bound to a remote server for file service 
due to failure of local servers will not automati- 
cally switch back to local servers upon successful 
recovery of local servers. Asa result, aclient may 
sull rely heavily on a remote server for file ser- 
vice for a long period of time after a crashed local 
server is back to work again. On the other hand, 
with clue tables, a client will always turn to local 
servers first to locate a file when it starts a new file 
session, i.e. opens a file. This guarantees that the 
bindings among clients and servers occur in con- 
formity with network hierarchy design whenever 
possible. 


. It implicitly carries out load balancing among 


servers to improve system scalability — 

The clue tables mechanism implements dynamic 
binding between clients and servers upon file 
open to achieve load balancing among servers. 
This is the main distinction of the clue ta- 
bles mechanism when compared with other dis- 
tributed file systems such as AFS[1, 5], Coda[6], 
Locus[7], V kernel[8], Amoeba[9], Ficus[10], 
and Deceit{11] that also implement dynamic 
binding. With clue tables, a client, upon open- 
ing a file, multicasts access requests to servers 
that have a replica of the file. If more than one 
server acknowledges the request, the client al- 
ways chooses the server that responds fastest for 
binding. With this practice, the clue tables mech- 
anism implicitly carries out load balancing among 
servers since a server with a lighter load has a bet- 
ter chance to respond faster than a server with a 
heavier load. Though it is not guaranteed that the 
server with the lightest load among multiple can- 
didates always responds fastest, a near-optimum 
load balancing situation should persist most of 
time. 


. It supports file replication and dynamically desig- 


nates a primary file copy to resolve possible data 
inconsistency — 

One crucial issue with file replication is how to 
maintain data consistency. The clue tables mech- 
anism resolves this issue by dynamically desig- 
nating a primary file copy. It is the dynamic 
nature and granularity of the mechanism that dis- 
tinguishes the clue tables mechanism from other 
distributed file systems that also employ a pri- 


mary copy based approach, e.g. Locus[7]. The 
main reason behind employing the dynamic ap- 
proach with file-level granularity is to distribute 
servers’ load. With clue tables, when one or more 
clients attempt to write to a file, one server is dy- 
namically designated as the primary server and 
all accesses to the file are temporarily forwarded 
to the primary server. Once the situation that 
could cause data inconsistency no longer exists, 
the primary server broadcasts the new version of 
the file to other servers. 


In the following part of the paper, section 2 de- 
scribes the basic structure of clue tables. Section 3 
elaborates on the system operations with clue tables. 
Section 4 addresses the implementation and perfor- 
mance issues. Section 5 concludes this paper. 


2 Basic Structure of Clue Tables 


The clue tables mechanism implements a global nam- 
ing space. The primitive entities in the clue tables 
mechanism are file collections termed domains. A 
domain is a subtree in the integrated file system of a 
distributed system. A file server can contain one or 
more domains while a domain cannot spread over mul- 
tiple file servers. A domain must be entirely stored 
on one file server. However, we may have replicated 
copies of a domain stored in a number of servers. Fig. 1 
illustrates the naming architecture with clue tables. 


Each file in a domain is an object composed of 
two attributes: logical file name and physical location. 
Each file is uniquely identified by its logical file name 
in the integrated file system. The physical location 
attribute specifies where the file resides. Clue tables 
are the directories that the clients refer to for locating 
a file based on its logical file name. Fig. 2 shows an 
example of clue tables. A clue table contains a number 
of entries, each of which corresponds to a domain in 
the integrated distributed file system. 


For example, in Fig. 2, there are two entries. The 
first entry corresponds to the domain with root “/usr” 
and the second entry is corresponds to the domain with 
root “/usr/bin”. A clue table entry specifies the file 
servers that have a copy of the domain. For example, 
the domain with root “/usr” has three replicated copies 
on file servers solar, earth, and global. 


With file replication, the availability and scal- 
ability of the system is significantly enhanced. The 
multiple servers listed in a clue table entry are grouped 
and prioritized. For example, in the first entry of Fig. 2, 
servers solar and earth form the first group, which is 
separated from the second group by a semicolon. The 
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Server A 


Client's view of the integrated 
distributed file system 


Figure 1: The naming architecture. Domain X is stored on servers A and B. Domain Y is stored on server C. Domain 


Z is stored on server B. 





second group contains only one server, global. When 
trying to locate a file, a client first turns to the servers 
in the first group. If all the servers in this group are un- 
available, e.g. crashed or unreachable due to network 
failure, the client then turns to the second group and 
so on. The motivation to group and prioritize servers 
in the list is to make the bindings among clients and 
servers occur in conformity with network hierarchy 
design whenever possible. 


A clue table entry is overridden by another entry 
when the second entry is with root a subdirectory of 
the first domain. For example, in Fig. 2, the second en- 
try, corresponding to the domain rooted by “/usr/bin”, 
overrides the first entry. When a client tries to locate a 
file, it first searches the clue table for the longest prefix 
of the domain that matches the filename. The client 
then sends requests to the servers in the list. 


Aclue aliasing mechanism is incorporated to pro- 
vide more flexibility in system integration. The second 
entry in Fig. 2 shows an example of clue aliasing. When 
the client accesses directory “/usr/bin” and sends a re- 
quest to server csm (see Fig. 3), itis actually accessing 
directory “/usr/sparc/bin” on csm. Through clue alias- 
ing, a directory ona server can appear to have different 
names on different clients. This adds desirable flexi- 
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bility to system integration. 


Certain rules apply to the creation of clue tables 
in a distributed file system: 


1. Each node, a client or server, in a distributed 
system should have a clue table. A clue table can 
be shared by two or more nodes but each node 
must have access to one clue table. The clue 
table of a node can be stored in a node’s local 
disk if it has one. If the node is diskless, then 
the clue table is stored in a remote server and 
is cached by the node in the memory while it is 
operating. 

2. Each node may have some private entries in its 
own clue table. One good use of this flexibility is 
the creation of the private “/ump” directory. Ifa 
client has local disks, it may be more appropriate, 
from the performance aspect, to place temporary 
files created by this client on its local disks rather 
than on remote servers. Even if the client does 
not have a local disk, it may share a localized 
“/tmp” directory with other diskless clients in the 
local cluster. 


3. If a domain is shared by multiple nodes, all the 
nodes must contain the same set of servers in their 
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/usr :rep=3: 
solar, 
earth; 
global. 

/usr/bin :rep=3: 


three replicated copies. 
server name list. 


second group. 
three replicated copies. 


earth, 
solar, 


csm=/usr/sparc/bin.# clue aliasing. 





Figure 2: An example of clue tables and clue aliasing. 
sss sien ssereepusmmnnennsnneesepenes 


clue table entries corresponding to this domain. 
The grouping and prioritizing of these servers 
may be different, reflecting each node’s physi- 
cal position in the network hierarchy. However, 
the set of servers in the entries must be identi- 
cal. Otherwise, data consistency among replicas 
of the domain cannot be maintained. (Detailed 
discussion on data consistency guarantees is pre- 
sented in next section. ) 


4. When creating a clue table for a node, we should 
group and prioritize the servers according to their 
proximity to the node in the network hierarchy. 
By doing so, the node will always find a file in the 
nearest available server and its interference with 
remote nodes in the network will be minimized. 


3 System Operations with Clue Tables 


This section discusses how the system operates with 
Clue tables. Since the clue tables mechanism imple- 
ments dynamic binding, the bindings among clients 
and servers can change on-the-fly. A client establishes 
a binding to a server upon the start of a file session. A 
file session is a series of file operations to a file enclosed 
by open and close operations. When aclient starts a file 
session, it first searches the clue table for the domain 
that includes the file and sends requests to the servers 
according to the grouping and priority specified in the 
Clue table. The servers that receive the request respond 
by locating the file in their own storage and returning 
a succeeded or failed message. If replicated copies ex- 
ist on several servers, the client will choose the server 
that responds most quickly to a succeeded message for 
service of this file session. Fig. 3 illustrates the access 
request flow. 


As mentioned earlier, through implementing dy- 
namic binding upon file opening, the clue tables mech- 


anism implicitly carries out load balancing among 
servers. It is conceivable that a server with a lighter 
load has a better chance to respond faster than a server 
with a heavier load. Though it is not guaranteed that 
the server with the lightest load among multiple can- 
didates always responds fastest, a near-optimum load 
balancing situation should persist most of time. 


When a client selects a particular server for a file 
session, the client caches the attribute block of the file, 
the server identification, and a file handle retuned by 
the server to speed up following file operations. The 
file handle is a unique file index assigned by the server 
for speedily identifying and locating an opened file. 
Note that the bindings among clients and servers are 
per file session basis. A client may tum to different 
servers for service of different file sessions. Due to 
network proximity and server load, two clients that are 
concurrently accessing the same file may be served 
by two different servers. As a result, access load is 
distributed over servers and the scalability of the system 
is significantly upgraded. 


The bindings among clients and servers may 
change during a file session. One reason is to elude 
access disruption caused by server or network failure. 
When such a failure occurs, the client will search the 
clue table for another server that has a replicated copy 
of the file. If this search succeeds, the client will estab- 
lish a binding with the second server and the user will 
not observe service disruption except that the latency 
of some file operations is longer. 


Another occasion in which rebinding is invoked 
is tO maintain data consistency among replicated file 
copies. With file replication, the system must be able 
to resolve potential data inconsistency when concurrent 
writing or read-write sharing occurs. 
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Figure 3: Flow of access requests. In this case, server csm is the fastest to reply a succeeded message. Client dfs1 will 


establish a binding to csm’s /usr/sparc/bin/foo. 


We addressed the data consistency issue by in- 
troducing a primary copy based preventive mechanism. 
When one or more clients open a file for writing, a 
server that holds a replica of the file is designated as 
the primary server. All other servers that also have 
a replica of the file invalidate their copies. Mean- 
while, those clients that are initially bound with the 
servers that have temporarily invalidated copies will 
Carry out rebinding operations to connect to the primary 
server. The primary server will provide all accessing 
services to the file as long as the writing operations 
continue. Upon termination of the situation, the pri- 
mary server will broadcast the new version of the file 
to other servers. 


An interesting issue here is how the primary 
server is selected. As mentioned earlier, the clue tables 
mechanism dynamically designates the primary server 
to distribute servers’ load. In the situation that only one 
client attempts to write to the file, the server that re- 
ceives the write request will become the primary server. 
If two or more clients want to open the file for writing 
at the same time, all the servers that receive a write 
request will compete to become the primary server. A 
simple arbitration mechanism based on a pre-assigned 
priority is used to determine the primary server. 


4 Implementation and _ Performance 


Evaluation 


Fig. 4 shows the structure of an implementation of 
the clue tables mechanism. This implementation is 
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based on Mach 2.6 operating system [12] and the VFS 
(Virtual File System) (13, 14]. On the client side, 
the VFS( Virtual File System) forwards file accesses 
that invoke the clue tables mechanism to the CluFS 
interface. The CluFS interface checks whether the file 
access hits the local disk/file cache. If not, the file ac- 
cess request is forwarded to a user-level process called 
the client daemon. The client daemon looks up the lo- 
cal clue table and multicasts the request to the servers 
according to the grouping and priority in the matched 
clue table entry. On the server side, the incoming re- 
quests are processed by a user-level process called the 
server daemon. The server daemon interfaces with the 
VFS to locate the file and returns a succeeded or failed 
message to the requesting client. 


To study the performance with the clue tables 
mechanism, we have conducted an experiment and 
compared the results with NFS[15]. The experimen- 
tal system consists of five Intel 80486 based personal 
computers connected by an ethemet network. All ma- 
chines run Mach 2.6 operating system and three out of 
the five act as servers while the remaining two act as 
clients. 


Fig. 5 shows the results from the experiment. In 
the experiment, we repeatedly open and close 30 files to 
test the overhead of dynamic binding operations. The 
horizontal axis gives the number of times the operations 
are repeated. The vertical axis is time the operations 
take in seconds. Fig. 5 shows that the clue tables mech- 
anism performs slightly better than NFS in file opens. 
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Figure 4: An implementation of the clue tables mechanism. 


This is due to the use of the socket mechanism[16] in 
the clue tables file system. The socket mechanism in- 
duces less overhead than the RPC (Remote Procedure 
Call) mechanism used in NFS. For file closes, NFS 
virtually takes no time. The reason is that NFS uses a 
stateless file cache coherence protocol and, as a result, 
does not invoke a remote procedure call when a file is 
closed. 


5 Conclusion 


In this paper, we presented a distributed, dynamic nam- 
ing mechanism called clue tables for building highly 
scalable, highly available distributed file systems. The 
clue tables naming mechanism is distinctive in three 
aspects. First, it is designed to cope well with the 
hierarchical structure of modem large-scale computer 
networks. Second, it implicitly carries out load balanc- 
ing among servers. Third, it supports file replication 
and dynamically designates a primary copy to resolve 
possible data inconsistency caused by concurrent ac- 
cesses to multiple file replicas. The clue tables naming 
mechanism is incorporated in the Azalea distributed file 
system currently being developed at National Taiwan 
University. 
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Optimistic Lookup of Whole NFS Paths in a Single Operation 
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Abstract 


VFS lookup code examines and translates path 
names one component at a time, checking for spe- 
cial cases such as mount points and symlinks. 
VFS calls the NFS lookup operation as necessary. 
NFS employs caching to reduce the number of 
lookup operations that go to the server. How- 
ever, when part or all of a path is not cached, 
NFS lookup operations go back to the server. Al- 
though NFS’s caching is effective, component-by- 
component translation of an uncached path is in- 
efficient, enough so that lookup is typically the op- 
eration most commonly processed by servers. We 
study the effect of augmenting the VFS lookup 
algorithm and the NFS protocol so that a client 
can ask a server to translate an entire path in a 
single operation. The preconditions for a success- 
ful request are usually but not always satisfied, so 
the algorithm is optimistic. This small change can 
deliver substantial improvements in client latency 
and server load. 


1 Introduction 


The NFS lookup operation frequently goes “over 
the wire” from client to server. For example, 
on the main file servers of Columbia’s Computer 
Science department, lookups constitute approxi- 
mately 31% of all NFS operations serviced. This 
makes lookup the most common operation in 
our environment, followed closely by null and 
getattr, and then by read.’ Similar results are 
typical at other installations; lookup is the most 


1This data was gathered by the nfsstat utility on eight 
file servers, all running SunOS version 4.1.3 and NFS ver- 
sion 2. The total number of NFS operations, including null, 
was nearly 4 million. The frequency of the other common 
operations was 27.5%, 22.7%, and 4.8% for null, getattr, 
and read, respectively. Across the eight servers there was 
considerable variance among the relative frequencies, but, 
in every case, lookup, getattr, and null were by far the 
most common operations. 
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common, or at least one of the two or three most 
common, NFS operation to reach the server.” 


These results are obtained despite the presence 
of a cache (called “directory-name lookup cache,” 
or DNLC) that is quite effective in mapping a 
(directory-unode, name-within-directory) tuple to 
the vnode for the name. Measuring in the same 
environment, we found an average DNLC hit rate 
of 75% for name lookups in NFS file systems. 
Apparently, the NFS client side calls lookup so 
often that a quarter of the calls (namely, the 
DNLC misses) are sufficient, by themselves, to 
make lookup the most common operation at the 
server. 


The seemingly high number of over-the-wire 
lookups has led us to wonder if they are all nec- 
essary and whether some steps might be taken to 
reduce their number. The most obvious approach 
is to increase the effectiveness of DNLC. We did 
this, in two ways: 


1. The DNLC implementation of SunOS 4.1.3 
will not cache a (directory-vnode, name- 
within-directory) tuple if the name is more 
than 15 characters long. Preliminary mea- 
surements indicated that a non-negligible 
fraction (12%) of DNLC misses were caused 
by component? names being longer than 15 
characters. Accordingly, we increased the 
maximum name size to 31 characters. 


This change reduced to zero the number of 
DNLC misses due to over-long component 
names. However, the effect on the num- 
ber of NFS lookups was negligible (a frac- 
tion of a percent). Investigation revealed 


2For example, the most common operations in the nhfs- 
stone benchmark are, in order: lookup (34%), read (22%), 
write (15%), and getattr (13%). 

3Following convention, we call the argument to VFS 
lookup a path, or pathname. A path consists of a sequence 
of components. The process of mapping a path or a com- 
ponent to a vnode we call translation or resolution. 
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that over-long names occurred predominantly 
in the local UNIX file system (UFS) rather 
than in remote NFS file systems.* This find- 
ing is obviously site-dependent and workload- 
dependent, so it might still be worthwhile to 
raise the 15-character limit, though perhaps 
to some number smaller than 31. 


2. In the typical configuration of SunOS 4.1.3, 
the size of DNLC is set according to the for- 
mula “17*MAXUSERS + 90.” MAXUSERS 
is set to 48, leading to 906 cache entries. 
We doubled this number, with the result of 
increasing the hit rate for NFS lookups by 
about half a percent. 


The two changes together resulted in increasing 
the DNLC hit rate for NFS lookups by less than 
one percent. We conclude that most lookup op- 
erations that go to the server are for pathnames 
that have not been looked up before, or else were 
looked up in the “distant past.” 


So the simple approach of increasing DNLC size 
will, by itself, not substantially reduce lookup 
traffic to the server. This should not be surprising, 
since DNLC has been available for many years, 
and its performance has presumably been tuned 
with some care. At least in our environment, it 
seems that the size of DNLC has been set to be- 
yond the point of diminishing returns. 


To substantially reduce lookup traffic to the 
server requires a more efficient method for looking 
up “new” pathnames. In the next two sections we 
describe and evaluate such a method. 


2 Path Lookup Algorithm 


Roughly speaking, the existing lookup algorithm 
used at the VFS level is: 


dir = vnode for start of path; 
for (35) { 
component = next_component (path) ; 
if (component is ..) { 
if (goes beyond process’ root) 
return error; 
while (dir is a mount point) { 
dir = cross back over mount point; 
if (goes beyond process’ root) 
return error; 
} 
} 
vnode = VOP_LOOKUP(dir, component) ; 
if (reached end of path) 
return vnode; 


*The DNLC module is defined at the VFS level, and 
is callable by any underlying file system, such as NFS or 
UFS. 


while (vnode is mounted on) 

vnode = root of overlaid f/s; 
if (vnode is symlink) 

prepend symlink to remaining path; 
else 

dir = vnode; 


The VOP_LOOKUP macro expands to call the lookup 
operation of the right type of underlying file sys- 
tem (e.g., NFS, HSFS,° etc.). That operation may 
use DNLC to reduce the number of lookup calls 
that go to the server; for example, both NFS and 
UFS do this. 


This algorithm translates component names 
into vnodes one-by-one, testing for three major 
special cases at each iteration: 


1. the vnode is a symlink 
2. the vnode is mounted-on 


“ 9 


3. the component is 


These special cases form the main reason why 
lookup happens one component at a time. Sym- 
links are hardest to handle, since they are a source 
of uncertainty. That is, a component cannot be 
known to be a symlink until the server indicates 
that it is, and expansion of the symlink can change 
the path arbitrarily. In particular, the unpre- 
dictability of the content of symlinks means that 
not all mount points are evident in a pathname 
when lookup begins. Crossing a mount point is 
a major operation, as it potentially changes the 
server to which lookup operations should be di- 
rected. Finally, references to the parent directory 
(i.e., “..” or “dot-dot”) might also lead to crossing 
a mount point (in the “up” direction, as opposed 
to the “down” direction of the previous case). 


An additional reason why the VFS lookup algo- 
rithm proceeds component-by-component is that 
the NFS protocol has been designed not to contain 
pathname syntax in the protocol because of the de- 
sirability of keeping operating system dependent 
detail out of the protocol specification [7]. Since 
NFS and UFS are the major file systems below the 
VFS layer, VFS algorithms have been designed to 
cater to their constraints. 


2.1 Overview 


The design of the VFS lookup algorithm is sensi- 
ble, since every component must be checked for 
the special cases. However, the component-by- 
component analysis of the path is the cause of the 
large number of lookups that go to the server. If 
there were no special cases, then whole paths could 
be looked up in a single server operation. 


>The High Sierra file system, for CD-ROM. 
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In fact, the special cases seldom arise. Measur- 
ing in the same environment mentioned earlier — 
name lookups generated by a multi-user workload 
applied to eight servers over several days — we 
found that 97.7% of paths resolved by the VFS 
lookup algorithm crossed no mount points and 
98.9% contained no symlinks. 


Our work capitalizes on these facts. We de- 
velop a “path lookup” operation that can trans- 
late several components of a path. This oper- 
ation assumes that the path includes no special 
cases. After a path-lookup, we apply some checks 
for the special cases. If any is found, then fur- 
ther path-lookups may be necessary, and it is 
possible that some of the work performed by the 
first path-lookup may have been wasted. Hence, 
path-lookup is an “optimistic” operation. The 
number of path-lookup operations and the extent 
to which some of them may perform wasted work 
varies for each path. However, for the overwhelm- 
ing majority of paths, a single path-lookup suf- 
fices to translate the path into a vnode. At worst, 
the path-lookup call will translate only the first 
component; so the ordinary lookup operation is 
the degenerate case of path-lookup. 


Our approach is, first, to add a path-to-vnode 
cache at the VFS level and, second, to augment 
NFS as necessary to lookup whole paths when- 
ever the path cache misses. Specifically, an ad- 
ditional path-lookup call is added to the NFS 
protocol; this call accepts a pathname which the 
server translates until the first symlink (if any) is 
encountered. The response contains three fields: 


1. The longest symlink-free prefix of the path. 
The prefix may be null. 


2. The file handle for the prefix. 


3. The untranslated suffix of the path, with the 
first symlink expanded and prepended. The 
suffix will be null if the path contains no sym- 


link. 


Note that our additional VFS-level path cache is 
separate from and logically above DNLC. DNLC is 
used within individual file systems; the path cache 
is used within the VFS lookup code only. Also, 
the results of path-lookup cannot be used to fill 
entries in DNLC, since DNLC maps component 
to vnode, whereas the path cache maps path to 
vnode. 


The path-lookup call should be directed only 
to servers that are capable of handling it. The 
proper approach would be to alter the MOUNT 
protocol so that, at mount time, the file server in- 
dicates if it can handle path-lookup, and, if so, 
which types of pathname syntax it understands. 
The client would then store this information in 


the struct vfs for that mount. However, to re- 
duce the number of required protocol changes, our 
code assumes that every mounted file system un- 
derstands the call, and tries it. If a “bad opera- 
tion” RPC error occurs, or if the RPC succeeds 
but the server indicates that it cannot handle the 
syntax of the pathname, the server’s inability is 
recorded in the struct vfs. Besides avoiding a 
change to the MOUNT protocol, this approach 
has the advantage of slightly easing incremental 
deployment. A disadvantage is that a server’s lim- 
its are repeatedly re-discovered (once per mount), 
and automounters — which are increasingly com- 
mon — tend to enormously increase the number 
of times that a file system is (un)mounted. 


2.2 Details 


This section explains how path lookup adjusts to 
the three special cases: symlinks, mount points, 
and dot-dot. 


2.2.1 Symlinks 


Any component of a path may be asymlink, and 
symlinks may expand to anything. Therefore, the 
servers and directories visited while translating a 
path are not predictable simply by inspecting the 
initial path. 


For an example, consider the path “./x/y/z” 
illustrated in Figure 1. If x were a mount point, 
then y/z should be resolved in a different file sys- 
tem than it would be if x were not a mount point. 
The client could detect if x were a mount point, 
since the client knows its mount points. However, 
x could also be a symlink that would expand to w, 
which in turn may or may not be a mount point, 
leading to the same predicament. 


The catch-22 is that the client cannot know 
which server to contact until it knows whether the 
path “is what it seems to be” and the client can- 
not know that a path is what it seems to be until 
its components have been looked up at the server. 


To break the cycle, we optimistically assume 
that the path contains no special cases. Referring 
to the example above, the client would lookup the 
path x/y/z on the server for directory “.” (which 
is necessarily the right server for the lookup of x). 
The server responds with a partition of x/y/z into 
a symlink-free prefix and a suffix that begins with 
the expansion of the first symlink, if any exists. 


The reason that the path-lookup operation 
translates only to the first symlink and not to the 
end of the path is that the optimistic assumption 
may be false. If the path does contain a special 
case, the server is probably wasting some effort 
translating a path that is different from the one 
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x is not a mount point 


x is a mount point 


covered root of file system: C) 


symlink: 


mounted on: it are iis 


x is a symlink 
Co aa eS 


Figure 1: Uncertainty Caused By Symlinks 





that should be translated. In such cases, the client 
must have enough information to survive the false 
assumption. We chose to have the server return 
from the path-lookup as soon as it encounters in- 
formation that might signal that the optimistic as- 
sumption is false. Note that the server cannot re- 
turn when the path crosses a mount point because 
the mount points that are relevant are those on the 
client and, practically speaking, the server cannot 
know the paths of the client’s mount points. So 
the server is doing all it can by returning when 
it encounters a symlink. Having the server return 
on every symlink has essentially no effect on per- 
formance (because of the rarity of special cases) 
and somewhat simplifies the client (since the client 
need retain information for and check for only two 
of the three special cases). 


2.2.2 Mount Points 


After the path-lookup returns, the client will 
examine the symlink-free prefix for mount points. 
If no mount point is found, then the prefix was 
translated on the correct server. So the algorithm 
repeats by sending the suffix, if any, to the same 
server. If the path of some mount point is con- 
tained in the prefix, then the path lookup may 


have been directed to the wrong server: so the 
portion of the path (prefix and suffix) below the 
first mount point is sent to the server for the 
mounted file system (assuming that it understands 
path-lookup). 


The reason for the mount-point check is that a 
server that looks up a path does so with respect to 
tts name space; however, the semantics of file name 
translation demand that a path be translated with 
respect to the name space of the client. Con- 
sider the example in Figure 2. During the trans- 
lation of /usr/local/gnu/bin/emacs, the name 
gnu/bin/emacs is translated by Server A because 
the client has mounted that server’s file system 
on its name /usr/local. However, the client has 
also mounted Server B’s file system on the name 
/usr/local/gnu. Therefore, the correct transla- 
tion is that of bin/emacs with respect to Server 
B’s file system, rather than gnu/bin/emacs with 
respect to Server A’s file system. So the transla- 
tion provided by Server A may be wrong. 


In order to have enough information to check for 
mount points, the client accumulates the symlink- 
free prefixes returned from all path-lookup calls. 
After each call returns, the current accumulated 
symlink-free prefix is compared against all mount 


Disk file systems 






Vnodes of the client 






the vnode of: a 


mounted on: o> 


Client’s local file system 


Server A 


Server B 


Figure 2: Why Symlink-free Path Must Be Compared Against Mount Points 





points. In order to provide fast search through all 
mount points, we added a trie index that points 
to all NFS mount points. The trie stores absolute 
path names, as shown in Figure 3. However, the 
pathnames generated by a process are resolved rel- 
ative to either its current root (curroot) or current 
working directory (cwd). This twist presents no 
problem: the vnodes for cwd and curroot are avail- 
able, and each contains a pointer to its struct 
vis; by definition, these structures are represented 
in the trie by their complete, absolute pathnames. 
Therefore, a pathname lies in a file system differ- 
ent from the one housing the starting point iff: 


1. There is another mount point farther down 
the branch of the trie housing the struct vfs 
of the starting point. 


2. The pathname is not embedded in the trie 
between the struct vfs of the starting point 
and the next struct vfs. 


Our implementation platform, SunOS 4.1.3, 
keeps the path of all its mount points only in 
the file /etc/mtab. For three reasons, we made 
changes so that the name of a mount point is also 
kept in the associated struct vfs. First, for per- 
formance: the pathnames of mount points have 


to be accessed on every path lookup. Second, to 
avoid race conditions: after initiating an I/O to 
access /etc/mtab the kernel would continue; the 
kernel’s next operation might be another that ac- 
cess or manipulates /etc/mtab. Finally, as part 
of earlier work [10], we had already written some 
code to store mount point names in struct vfs. 


2.2.3 Dot-dot 


Dot-dot must be handled with care similar to 
that for mount points. The reason is the same: 
the server will interpret dot-dot with respect to 
its name space, whereas the required semantics are 
with respect to the client’s name space. Usually 
the two interpretations are the same. The only 
exception is if dot-dots in the path result in going 
above the root of the remote file system.® For 
example, suppose /usr/local is an exported file 
system; then the path “/usr/local/..” refers to 
/usr on the client, not the server. 


Unfortunately, we thought of no simple and 
clean check for and adjustment to the possibil- 
ity of backing up over the root of the contain- 


6The NFS server checks for and prevents this case on 
every lookup. 
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ing file system. The rub is that it is messy for 
the client to remember the path of every start- 
ing point (i.e., process cwd or curroot). So before 
each call to path-lookup, the client checks if the 
path goes above the starting point (i.e., cwd or 
curroot) of the translation. If so, the path lookup 
is aborted and the regular VFS lookup algorithm 
is used. This conservative approach means that 
any path that begins with dot-dot is not pro- 
cessed using path-lookup. Similarly, after the 
final path-lookup operation responds, the client 
checks the symlink-free prefix; if the prefix would 
back up over the starting point, then the path 
lookup algorithm is aborted. 


2.2.4 Path Cache 


The path cache is referenced from within the 
new VFS lookup algorithm and from within the 
NFS code for validating caches. Most of the code 
for the path cache was copied and adapted from 
DNLC. 


Since NFS provides no means for a server to 
call back to a client, the path cache can contain 
outdated information.’ Stale cache entries are re- 
moved by NFS’s normal timeout-driven checking 
of vnode attributes and, since the cache is man- 
aged LRU, by aging the oldest entry during an 
insert operation. All these traits are shared with 
DNLC. 


7The same is true for DNLC and, indeed, for a client- 
side cache of any type of information about remote NFS 
files. 


One difference between the path cache and 
DNLC is that explicit deletion (such as when a 
file is deleted or renamed) is handled slightly dif- 
ferently. The delete and rename operations delete 
from the path cache by vnode since a vnode is a 
unique ID and since it would be difficult to con- 
struct a path for the target. DNLC entires can be 
deleted by either vnode or component name. 


2.2.5 Protocol Change 


The definition of the new call added to version 
2 of the NFS protocol is: 


struct pathlookupargs { 
nfspath pathname; 
int syntax; 


yi 


struct pathlookupokres { 
nfs_fh file; 
fattr attributes; 
nfspath prefix; 
nfspath suffix; 


\e 


union pathlookupres 
switch (nfsstat status) { 
case NFS_OK: 
pathlookupokres pathlookupres; 
default: 
void; 
}; 


pathlookupres 
NFSPROC_PATH_LOOKUP(pathlookupargs) = 18; 


A new “unintelligible syntax” error code was 
necessary. However, since, at the server, path 
lookup is simply an iterative application of the 
regular lookup operation, all the same access con- 
straints apply. 


2.2.6 Server Change 


To implement the path-lookup operation on 
the server side, we stole code from other NFS 
operations, especially lookup. Essentially, the 
path-lookup implementation is that of lookup 
with two main additions: 


1. Instead of translating a single component, the 
code iterates over components until it reaches 
the end, a symlink, or an error. 


2. When a symlink is encountered, it is read 
(by calling the server-side operation to read 
a symlink) and prepended to the remaining 
untranslated path. 


3 Evaluation 


This algorithm is implemented in SunOS 4.1.3 and 
is part of the operating system regularly booted on 
nine SparcStations. Fewer than a thousand lines 
of code were added. On the server side, only a 
small addition was made to the module that im- 
plements NFS operations (nfs_server.c). Most 
changes were on the client side, where some mod- 
ules received major changes: vfs_lookup.c, and a 
few others needed to store the path name of mount 
points in the struct vfs. 


Before implementing, we studied the distribu- 
tion of lengths of pathnames. The longer the 
pathname given to the path lookup algorithm, 
the greater the upside potential. Measuring in 
the same environment as noted earlier — eight 
major multi-user departmental file servers — we 
found considerable variation in path length distri- 
bution from machine to machine and time to time. 
However, on average path lengths of 4 were most 
common, with lengths of 3 the next most com- 
mon. Paths of length 3 or 4 together accounted 
for over 70% of lookups. Of the remainder, most 
were shorter. 


To measure the effect of our changes, we used 
a kernel build as a benchmark; all major kernel 
sources, libraries, and include files were on a re- 
mote file system. Within a short period of time, a 
kernel build opens a large number of files in a rel- 
atively small number of directories. Kernel builds 
provide a friendly test for path lookup: the av- 
erage path length is somewhat longer (4.6) than 
the more comprehensive number noted above. Be- 
cause of background activity, the results varied 
a little, but on average the build ran 8% faster 
with path lookup in effect. Eight percent is a sub- 
stantial speedup considering that it is the effect of 
changing only the lookup operation. 


Measuring the effect path lookup has on the 
server is at once easier and harder than measuring 
client latency. Measuring number of server opera- 
tions is easy, but each path lookup operation can 
be expected to perform more work than an ordi- 
nary NFS lookup. We used nfsstat to measure 
the number of operations serviced and umstat to 
record processor idle time and I/O operations. For 
a set of kernel build benchmarks, the number of 
NFS operations declined 20% and processor idle 
time averaged 16% higher with path lookup in ef- 
fect. Apparently, processor overhead for handling 
NFS requests is substantial. 


3.1 Violation of NFS Design Principle 


As noted earlier, the addition of path-lookup vio- 
lates the longstanding design decision to keep the 


NFS protocol free of path syntax. 


One possible rejoinder is that, while it is true 
that it is desirable to keep operating system 
specifics out of the NFS protocol, this design de- 
cision was made several years ago; since then, 
NFS, though widely ported, has received almost 
all its use on DOS and UNIX platforms. We ques- 
tion whether the substantial negative impact on 
performance caused by component-by-component 
lookup is acceptable considering that the abstrac- 
tion offered by omitting pathname syntax “ab- 
stracts” over effectively only two implementations. 


Another, possibly better, rejoinder would be 
to re-design our protocol change so that it ac- 
cepted and returned not paths, but rather vectors 
of opaque components. It would still be neces- 
sary to exchange an indication of how to interpret 
the components, but at least the letter, if not the 
spirit, of the original design principle would be 
preserved. We have not yet made this change to 
the protocol. 


4 Related Work 


As noted in the introduction, lookup, getattr, 
and null comprise the vast majority (over 80%) of 
NFS operations handled by our main servers. The 
high number of null operations is attributable to 
our use of the Amd automounter [5]; Amd period- 
ically “pings” every mounted file system, and NFS 
null is the ping operation. While the high number 
of null operations can thus be dismissed as site- 
specific, the dominance of getattr and lookup is 
typical for most NFS installations — as the opera- 
tion mix in nhfsstone indicates. So naturally there 
has been interest in sharply reducing the frequency 
of these operations. 


Most such interest has focused on getattr, 
although in most experiments the motivation 
has not been simply to reduce the frequency of 
getattr but rather to improve the consistency 
guarantee provided to the client by an NFS server. 
One early experiment is “Spritely NFS” [6], in 
which a callback scheme similar to that in Sprite 
[4] was added to NFS with the intention of pro- 
viding strict cache consistency and improving per- 
formance by eliminating the overhead of refresh- 
ing the attribute cache with getattr. The ideas 
in Spritely NFS were used in modified form in 
“NQNFS” [3] which is a second protocol available 
from the NFS implementation of 4.4BSD. NQNFS 
differs from Spritely NFS in that the latter re- 
quires the server to keep state indicating the cache 
status of files at clients; however, NQNFS borrows 
the “lease” idea from Gray and Cheriton [2] in 
order to avoid the need for servers to keep state 
across failures. In NQNFS, a client is allowed to 
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cache a file for a specified period of time. The 
only recovery action a server must take is to wait 
until such time as all of its leases must have ex- 
pired. Neither of these systems addresses the issue 
of reducing lookups. 


More recently, the specification for version 3 of 
NFS [8] included, among its many changes, the re- 
quirement to return attributes as a side effect of 
every appropriate NFS operation. The intention 
is to reduce the number of separate getattr op- 
erations that must be invoked in order to verify 
attribute cache consistency. 


Cache consistency and NFS version 3 are a bit 
far afield, but we are not aware of any attempts 
— besides DNLC — to reduce the cost and/or 
number of NFS lookup operations. However, the 
notion of looking up and caching whole or partial 
paths has been proposed before for new file system 
designs [9, 1]. 


In 1986 Welch and Ousterhout described “pre- 
fix tables” [9]. Prefix tables are useful in an en- 
vironment where a shared global hierarchy of files 
is partitioned into “domains,” which are spread 
across servers. Each client maintains a prefix ta- 
ble that maps file name prefixes to the servers on 
which the associated domains reside. Prefix ta- 
ble entries are hints: if a file is not where a table 
says it is, shorter prefixes are tried until the file is 
found. If a client has no prefixes at all for a file 
(as will be the case initially), it broadcasts the file 
name to all servers. Relevant prefix/server map- 
pings are returned by all servers that have such 
mappings. In this way, prefix table information 
can be easily propagated without the requirement 
that any two clients have precisely the same table. 
And because prefix tables contain only hints that 
need not be correct, this method avoids creating 
either an availability or a consistency problem. 


The prefix table idea is quite similar to our 
work; however, there is one major difference be- 
tween the model of file system use in NFS and that 
in the prefix table proposal. Welch and Ouster- 
hout describe a construct called a “remote link,” 
which is apparently a replacement for the idea 
of client mounts. Distinct domains are stitched 
together with remote links, which are server-side 
mounts; that is, the client has no control over how 
to overlay domains on top of one another — the 
information is encoded in remote links in the file 
system, and all clients see the same arrangement 
of domains into a hierarchy. Implementing static 
mounts on the server side is a significant simpli- 
fication for whole-path translation (and a signifi- 
cant loss of flexibility for the client). In our design, 
the server returns after encountering a symlink 
and the client must check the symlink-free path for 
mount points — both of these features exist only 


because the server cannot know the client’s mount 
points. In Welch and Ousterhout’s system, un- 
like in NFS, there is no complication with having 
the server expand a symlink and continue trans- 
lating it without contacting the client. In their 
work, the only time aserver returns a pathname to 
the client not completely translated is when some 
component of the path crosses a domain bound- 
ary: either dot-dot in the upward direction or a 
remote link in the downward direction. In these 
cases, the client must check its prefix table to learn 
which server to send the remainder of the path to. 
In summary, prefix tables is an elegant idea but 
one targeted for a significantly different and easier 
model of file system definition and use. 


In [1], Cheriton and Mann describe a naming 
system that is scalable enough to encompass the 
world and general enough to name many types of 
objects (not just files — processes, windows, net- 
work connections, etc.) Since the features that 
draw most of their design attention are those that 
permit scaling to enormous size, it is hard to make 
a meaningful comparison between our work and 
theirs. Their system has the notion of looking up 
and caching whole or partial pathnames. How- 
ever, they reject the notion of client mounts on the 
grounds that such client-specific name space man- 
agement operations do not scale well and stand in 
the way of forming a consistent global name space. 
The notion of symbolic links seems absent from 
their design, presumably on similar grounds. 


5 Summary 


Measurements of NFS pathnames and Lookup per- 
formance yield several pronounced facts: 


e The hit rate of DNLC is not easily improved; 
nevertheless, enough lookup operations can- 
not be satisfied from DNLC so that lookup is 
the operation that most commonly goes over 
the wire to the server. 


e Average path length is long enough so that 
translating a completely uncached path will 
often require as many as 3 or 4 lookup oper- 
ations. 


e Paths given to the VFS lookup algorithm al- 
most never contain symlinks or cross mount 
points, and so can typically be translated at 
the server in a single operation. 


Given these facts, one may question whether the 
elegance and relative simplicity of the VFS lookup 
algorithm — which translates uncached path- 
names one component at a time — is sufficient 
compensation for its high overhead. 


Indeed, the server load caused by repeated NFS 
lookup operations can be reduced by up to 16% if 
path lookup is used instead of component lookup. 
Also, path lookup can have a noticeable effect 
on client latency for workloads that are open- 
intensive. 


Path lookup is not hard to implement, the only 
tricky aspect being that a translated path must be 
examined even after a “successful” translation in 
order to ensure that the middle of the translated 
path did not cross a mount point. If it did, then 
the pathname looked up at the server is not the 
path that should have been looked up — the por- 
tion of the pathname below the mount point has 
different meaning on the two different servers. 
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Abstract 


We consider how to improve the performance of file 
caching by allowing user-level control over file cache 
replacement decisions. We use two-level cache man- 
agement: the kernel allocates physical pages to in- 
dividual applications (allocation), and each applica- 
tion is responsible for deciding how to use its physi- 
cal pages (replacement). Previous work on two-level 
memory management has focused on replacement, 
largely ignoring allocation. 

The main contribution of this paper is our so- 
lution to the allocation problem. Our solution al- 
lows processes to manage their own cache blocks, 
while at the same time maintains the dynamic al- 
location of cache blocks among processes. Our so- 
lution makes sure that good user-level policies can 
improve the file cache hit ratios of the entire system 
over the existing replacement approach. We evalu- 
ate our scheme by trace-based simulation, demon- 
strating that it leads to significant improvements in 
hit ratios for a variety of applications. 


1 Introduction 


File caching is a widely used technique in today’s 
file system implementations. Since CPU speed and 
memory density have improved dramatically in the 
last decade while disk access latency has improved 
slowly, file caching has become increasingly impor- 
tant. One major challenge in file caching is to pro- 
vide high cache hit ratio. 

This paper studies an application-controlled file 
caching approach that allows each user process to 
use an application-tailored cache replacement pol- 
icy instead of always using a global Least-Recently- 
Used (LRU) policy. Some applications have special 
knowledge about their file access patterns which can 
be used to make intelligent cache replacement deci- 
sions. For example, if an application knows which 
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blocks it needs and which it does not, it can keep 
the former in cache and reduce its cache miss ratio. 

Traditionally such applications buffer file data in 
user address space as a way of controlling replace- 
ment. However, since the kernel tries to cache file 
data as well, this approach leads to double buffer- 
ing, which wastes space. Furthermore, this ap- 
proach does not give applications real control be- 
cause the virtual memory system can still page out 
data in the user address space. Hence we need an- 
other way to let applications control replacement. 

To reduce the miss ratio, a user-level file cache 
needs not only an application-tailored replacement 
policy but also enough available cache blocks. In 
a multiprocess environment, the allocation of cache 
blocks to processes will thus affect the file cache hit 
ratio of the entire system. It is the kernel’s job to 
ensure that the hit ratio of the whole system does 
not degrade because of the user-level management 
of cache replacement policies. The challenge is to 
allow each user process to control its own caching 
and at the same time to maintain the dynamic al- 
location of cache blocks among processes in a fair 
way so that overall system performance improves. 

This paper describes a scheme that achieves this 
goal. Our approach, called “two-level block replace- 
ment”, splits the responsibilities of allocation and 
replacement between kernel and user level. A key 
element in this scheme is a sound allocation policy 
for the kernel, which is discussed in section 3. This 
allocation policy guarantees that an application- 
tailored replacement policy can improve the overall 
file system performance and that a foolish replace- 
ment policy in one application will not degrade the 
file cache hit ratios of other processes. 

We have evaluated our allocation policy using 
trace-driven simulation. In our simulations, we 
used several file access traces that we collected on 
a DEC 5000/200 workstation running the Ultrix 
operating system, and the Sprite traces from Uni- 
versity of California at Berkeley. We have simu- 
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lated our allocation policy and various replacement 
policies for individual application processes. The 
simulations show that an application-tailored re- 
placement policy can reduce an application’s file 
cache miss ratio up to 100%, over the global LRU 
policy. In addition, in a multiprocess environ- 
ment, the combination of our allocation policy and 
application-tailored replacement policies can reduce 
the overall file cache miss ratios, over the traditional 
global file caching approach, by up to 50%. 


2 User Level File Caching 


Our goal is to allow user-level control over cache 
replacement policy. In many cases, the application 
has better knowledge about its future file accesses 
than the kernel has. User level control of cache 
replacement enables the application to use its better 
knowledge to improve the hit ratio. 

Despite the advantages of application control, 
we cannot simply move all responsibility for cache 
management to the user level. In a multipro- 
grammed system, the kernel serves a valuable func- 
tion: managing the allocation of resources between 
users to guarantee the performance of the entire 
system. 


2.1 Two-Level Replacement 


We propose a scheme for file caching that splits re- 
sponsibility between the kernel and user levels. The 
kernel is responsible for allocating cache blocks to 
processes. Each user process is free to control the 
replacement strategy on its share of cache blocks; if 
it chooses not to exercise this choice, the kernel ap- 
plies a default policy (LRU). We call our approach 
two-level cache block replacement. 

To be more precise, each file is assigned to a 
“manager” process, which is responsible for making 
replacement decisions concerning the file. Usually 
the process that currently has the file open is its 
manager; however, this process may designate an- 
other process to be the manager of a particular file. 
If several processes have the same file open simulta- 
neously, then it is up to these processes to agree on 
a manager; if they cannot agree then the kernel im- 
poses the default LRU policy for that file. Processes 
that do not want to control their own replacement 
policy can abdicate their management responsibil- 
ity; in this case the kernel applies the default LRU 
policy for the affected files. 

The interactions between kernel and manager 
processes are the following: On a cache miss, the 
kernel finds a candidate block to replace, based on 


Application 
Q 









3. Q gives up B 


Figure 1: Interaction between kernel and user pro- 
cesses in two-level replacement: (1) P misses; (2) 
kernel consults Q for replacement; (3) Q decides to 
give up page B; (4) kernel reallocates B to P. 


its global replacement policy (step 1 in Figure 1). 
The kernel then identifies the manager process of 
the candidate. This manager process is given a 
chance to decide on the replacement (step 2). The 
candidate block is given as a hint, but the manager 
process may overrule the kernel’s choice by suggest- 
ing an alternative block under that manager’s con- 
trol (step 3). Finally, the block suggested by the 
manager process is replaced by the kernel (step 4). 
(If the manager process is not cooperative, then the 
kernel simply replaces the candidate block.) 

The kernel’s replacement policy is in fact an al- 
location policy. Suppose that process P’s reference 
misses in the cache and the kernel finds a replace- 
ment candidate owned by process Q. Although pro- 
cess Q’s user-level replacement policy decides which 
of its blocks will be replaced, the replacement will 
cause a deallocation of a block from process Q and 
an allocation of a block to process P. 


2.2 Kernel Allocation Policy 


The kernel allocation policy is the most critical part 
of two-level replacement. To obtain best perfor- 
mance, it is known that allocation should follow 
the dynamic partition principle [11]: each process 
should be allocated a number of cache blocks that 
varies dynamically in accordance with its working 
set size. Experience has shown that global LRU or 
its approximations perform relatively well; they ap- 
proximate the dynamic partition principle or tend 
to follow processes’ working set sizes. 

Our goal is to design an allocation policy for the 
kernel to guarantee that the two-level replacement 
method indeed improves system performance over 
the traditional global (or single level) replacement 
method when user processes are not making bad re- 
placement decisions. To be more precise, let us call 
a decision to overrule the kernel wise if the alter- 
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native block is referenced before the candidate sug- 
gested by the kernel, and call the decision foolish if 
the candidate will be referenced first. (A decision 
not to overrule the kernel can be viewed as neutral.) 
The kernel’s allocation policy should satisfy three 
principles: 


1. A process that never overrules the kernel does 
not suffer more misses than it would under 
global LRU. This ensures that ordinary pro- 
cesses, which are unwilling or unable to predict 
their own accesses, will perform at least as well 
as they did before. 


2. A foolish decision by one process never causes 
another process to suffer more misses. Of 
course, we cannot prevent a process from dis- 
carding its valuable pages. However, we must 
ensure that an errant or malicious process can- 
not hurt the performance of others. 


3. A wise decision by one process never causes 
any process, including itself, to suffer more 
misses. This ensures that processes have an 
incentive to choose wisely. (It goes without 
saying that wise decisions should actually im- 
prove performance whenever possible.) 


The main contribution of this paper is to propose 
and evaluate an allocation policy that satisfies these 
design principles. 


3 An Allocation Policy 


We will describe our allocation policy in an evolu- 
tionary fashion. We will start with a simple but 
flawed policy, and then diagnose and fix two prob- 
lems with it. The result will be a fully satisfactory 
allocation policy. 


3.1 First Try 


To start, we can have an allocation policy that is 
literally the same as that of global LRU. The kernel 
simply maintains an LRU list of all blocks currently 
in the file cache. When a replacement is necessary, 
the block at the end of the LRU list is suggested as 
a candidate, and its owner process is asked to give 
up a block. 

The problem is that if the owner process overrules 
the kernel, the candidate block still stays at the end 
of the LRU list. On the next miss, the same process 
will again be asked to give up a block. 

The left side of Figure 2 shows an example. Two 
processes, P and Q, share a file cache with four 
blocks. Process P uses blocks A and B; process Q 
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uses blocks W, X, Y, and Z. The reference stream 
is ¥, Z,&, B. 

The top line shows the initial LRU list, at time 
to. The first reference, to Y at time t,, causes a 
replacement. The kernel consults its LRU list and 
suggests A for replacement. A’s owner, process P, 
overrules the kernel. It decides to replace B, hoping 
to save the next miss to A. The LRU list is now as 
shown at time to. The next reference, to Z at time 
t3, causes another replacement. Again, A is chosen 
as a candidate. This time P has no other blocks in 
the cache and hence must give up A. The LRU list 
is now as shown at time t,. At this point, the next 
two references, to A and B, both miss. 

There are four misses in this example, two misses 
by P and two by Q. But note that under global 
LRU, there would be only three misses, one by P 
and two by Q. This violates Principle 3: a wise 
decision by process P causes P to suffer one extra 
miss. 


3.2 Swapping Position 


The problem in the above scheme arises because 
the LRU list is maintained in strict reference order. 
Intuitively, the only use of the LRU list is to decide 
which process will give up a block upon a miss. 
To get the same allocation policy as the existing 
global LRU policy, our policy’s LRU list should be 
in correspondence to the LRU list in the original 
algorithm. This can be achieved by swapping the 
blocks’ positions in the LRU list. 

Suppose the kernel suggests A for replacement, 
but the user-level manager overrules it with B. At 
this point, the previous policy would simply replace 
B. The new policy first swaps the positions of A and 
B in the LRU list, then proceeds to replace B. As a 
result of this swap, A is no longer at the tail of the 
LRU list. Compared with the LRU list under global 
LRU (i.e. if A is replaced), the only difference is 
that A is in B’s position. 

This fixes the problem with above example as in 
Figure 2. The right side of the figure shows what 
happens under the new policy. On the first replace- 
ment, A moves to the head of the LRU list before 
B is replaced. The result is that A is still in the 
cache when it is referenced. Process P is no longer 
hurt by its wise choice. 

In general, swapping positions guarantees that if 
no process makes foolish choices, the global hit ratio 
is the same as or better than it would be under 
global LRU. 

Unfortunately, this scheme does not guard 
against foolish choices made by user processes. This 
is illustrated by the left side of Figure 3. Processes 
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Figure 2: This example shows what’s wrong with the first try and how to fix it with the swapping position 


mechanism. 


P and Q share a three-block cache. The top line 
shows the initial LRU list at time to; the reference 
stream is Z, Y, A. The first reference, to Z at time 
t;, causes a replacement. The kernel suggests X 
for replacement. Now suppose process Q makes ex- 
actly the wrong choice: it decides to replace Y. Af- 
ter swapping X and Y in the LRU list, the kernel 
replaces Y, leading to the LRU list as shown at time 
ty. The second reference, to Y at time tg, misses. 
The kernel suggests A for replacement, and process 
P cannot overrule because it has no alternative to 
suggest. Thus A is replaced, leading to the LRU 
list as shown at time t4. The third reference, to A 
at time ts, misses. 

There are three misses in this example, one miss 
by P and two by Q. Under global LRU, there would 
be only one miss, by Q. Had Q not foolishly over- 
ruled the kernel, the last two references would both 
have hit in cache. Principle 2 is violated — process 
Q’s foolish decision causes process P to suffer an 
extra miss. 


3.3. Place-Holders 


The problem in the previous example arises because 
Q’s choice enables it to acquire more cache blocks 
than it would have had under global LRU. As a re- 
sult, some of P’s blocks are pushed out of the cache, 
which increases P’s miss rate. To satisfy Principle 
2, we must prevent foolish processes from acquir- 
ing extra cache blocks. We achieve this by using 
place-holders. 

A place-holder is a record that refers to a page. It 
records which block would have occupied that page 
under the global LRU policy. Suppose the kernel 


suggests A for replacement, and the user process 
overrules it and decides to replace B instead. In 
addition to swapping the positions of A and B in 
the LRU list, the kernel also builds a place-holder 
for B to point to A’s page. If B is later referenced 
before A, A’s page can be confiscated immediately. 
This allows the cache state to recover from the user 
process’s mistake. 


The right side of Figure 3 illustrates how place- 
holders work. The top line shows the initial LRU 
list at time to. The first reference, to Z at time 
t;, misses. Block X is chosen as a candidate for 
replacement, but process Q (foolishly) overrules the 
kernel and chooses Y for replacement. X and Y are 
swapped in the LRU list, and Y is replaced. At this 
point, a place-holder is created, denoting the fact 
that the block occupied by X would have contained 
Y under global LRU. 


The second reference, to Y at time ¢3, misses. 
The kernel notices that there is a place-holder for Y 
— Y “should” have been in the cache, but was not, 
due to a foolish replacement decision by Y’s owner. 
The kernel responds to this situation by correcting 
the foolish decision: it loads Y into the page that 
Y’s place-holder pointed to, replacing X. Note that 
in this case the normal replacement mechanism is 
bypassed. 


The LRU list is now as shown at time t4. The 
third reference, to A, hits. 


This example results in two misses, both by pro- 
cess Q. Under global LRU, there would have been 
only one miss, by Q. Q hurts itself by its foolish de- 
cision, but it does not hurt anyone else. Principle 2 
is satisfied. 
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Figure 3: This example shows why place-holders are necessary. 


3.4 Our Allocation Scheme 


Combining above two fixes, here is our full alloca- 
tion scheme: If a reference to cache block b hits, 
then 6 is moved to the head of the global LRU list, 
and the place-holder pointing to 6 (if there is one) 
is deleted. If the reference misses, then there are 
two cases. In the first case, there is a place-holder 
for 6, pointing to ¢; in this case t is replaced and 
its page is given to b. (If ¢ is dirty, it is written to 
disk.) 

In the second case, there is no place-holder for b. 
In this case, the kernel finds the block at the end of 
the LRU list. Say that block c, belonging to process 
P, is at the end of the LRU list. The kernel consults 
P to choose a block to replace. (The kernel suggests 
replacing c.) Say that P’s choice is to replace block 
xz. The kernel then swaps x and c in the LRU list. 
If there is place-holder pointing to z, it is changed 
to point to c; otherwise a place-holder is built for 
z, pointing to c. Finally, z’s page is given to b. (x 
is written to disk if it is dirty.) 

We can prove that this algorithm satisfies all 
three of our design principles. (A detailed formal 
proof appears in [4].) Our scheme has the prop- 
erty that it never asks a process to replace a block 
for another process more often than global LRU. In 
other words, whenever a process is asked to give up 
a block for another process, it would have already 
given up that block under global LRU. 

To see why, first notice that the place-holder 
scheme ensures that every process appears, in the 
view of other processes, never to unwisely overrule 
the kernel’s suggestions. This is because whenever 
a process makes an unwise decision, only the errant 
process is punished and the state is restored as if 
the mistake were never made. 


Hence, from the allocator’s point of view, every 
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process is either doing LRU or is doing something 
better than LRU. Therefore we need only guarantee 
that those that are doing better than LRU are not 
discriminated against. Swapping position serves 
this purpose. That is, the allocator doesn’t care 
which of a process’s pages is holding which data, 
as long as its pages occupy the same positions on 
the LRU list that they would have occupied under 
global LRU. 

In summary, our framework for incorporating 
user level control into replacement policy is: con- 
sulting user processes at the time of replacement; 
swapping the positions of the block chosen by the 
global policy and the block chosen by the user pro- 
cess in the LRU list; and building “place-holder” 
records to detect and recover from user mistakes. 
This framework is also applicable to various policies 
that approximate LRU, such as FIFO with second 
chance, and two-hand clock[16]. 


4 Design Issues 


This section addresses various aspects of our 
scheme, including possible implementation mecha- 
nisms, treatment of shared files and interaction with 
prefetching. 


4.1 User-Kernel Interaction 


There are several ways to implement two-level re- 
placement, trading off generality and flexibility ver- 
sus performance. 

The simplest implementation is to allow each user 
process to give the kernel hints. For example, a 
user process can tell the kernel which blocks it no 
longer needs, or which blocks are less important 
than others. Or it can tell the kernel its access 
pattern for some file (sequential, random, etc). The 
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kernel can then make replacement decisions for the 
user process using these hints. 

Alternatively, a fixed set of replacement policies 
can be implemented in the kernel and the user pro- 
cess can choose from this menu. Examples of such 
replacement policies include: LRU with relative 
weights, MRU (most recently used), LRU-K[21], 
etc. 

For full flexibility, the kernel can make an upcall 
to the manager process every time a replacement 
decision is needed, as in [18]. 

Similarly, each manager process can maintain a 
list of “free” blocks, and the kernel can take blocks 
off the list when it needs them. The manager would 
be awakened both periodically and when its free-list 
falls below an agreed-upon low-water mark. This is 
similar to what is implemented in [25]. 

Combinations of these schemes are possible too. 
For example, the kernel can implement some com- 
mon policies, and rely on upcalls for applications 
that do not want to use the common policies. In 
short, all these implementations are possible for our 
two-level scheme. We are still investigating which 
is best. 


4.2 Shared Files 


As discussed in Section 2.1, concurrently shared 
files are handled in one of two ways. If all of the 
sharing processes agree to designate a single pro- 
cess as manager for the shared file, then the kernel 
allows this. However, if the sharing processes fail 
to agree, management reverts to the kernel and the 
default global LRU policy is used. 


4.3. Prefetching 


Under two-level replacement, prefetches could be 
treated in the same way as in most current file sys- 
tems: as ordinary asynchronous reads. 

Most file systems do some kind of prefetching. 
They either detect sequential access patterns and 
prefetch the next block[17], or do cluster I/O[19]. 

Recent research has explored how to prefetch 
much more aggressively. In this case, a significant 
resource allocation problem arises — how much of 
the available memory should the system allocate 
for prefetching? Allocating too little space dimin- 
ishes the value of prefetching, while allocating too 
much hurts the performance of non-prefetch ac- 
cesses. The prefetching system must decide how 
aggressively to prefetch. 

Our techniques do not address this problem, nor 
do they make it worse. The kernel prefetching 


code would still be responsible for deciding how ag- 
gressively to prefetch. We would simply treat the 
prefetcher as another process competing for mem- 
ory in the file cache. However, since the prefetcher 
would be trusted to decide how much memory to 
use, our allocation code would provide it with a 
fresh page whenever it wanted one. 

Recent research on prefetching focuses on ob- 
taining information about future file references[22]. 
This information might be as valuable to the re- 
placement code as it is to the prefetcher, as we dis- 
cuss in the next section. Thus, adding prefetching 
may well make the allocator’s job easier rather than 
harder. 

To facilitate the use of a sophisticated prefetcher, 
there can be more interaction between the alloca- 
tor and the prefetcher. For example, the allocator 
could inform the prefetcher about the current de- 
mand for cache blocks; the prefetcher could vol- 
untarily free cache blocks when it realized some 
prefetched blocks were no longer useful, etc. The 
details are beyond the scope of this paper. 


5 Simulation 


We used trace-driven simulation to evaluate two- 
level replacement. In our simulations the user-level 
managers used a general replacement strategy that 
takes advantage of knowledge about applications’ 
file references. Two sets of traces were used to eval- 
uate the scheme. 


5.1 Simulated Application Policies 


Our two-level block replacement enables each user 
process to use its own replacement policy. This 
solves the problem for those sophisticated applica- 
tions that know exactly what replacement policy 
they want. However, for less sophisticated appli- 
cations, is there anything better than local LRU? 
The answer is yes, because it is often easy to obtain 
knowledge about an application’s file accesses, and 
such knowledge can be used in replacement policy. 

Knowledge about file accesses can often be ob- 
tained through general heuristics, or from the com- 
piler or application writer. Here are some examples: 


e Files are mostly accessed sequentially; the suf- 
fix of a file name can be used to guess the usage 
pattern of a file: “.o” files are mostly accessed 
in certain sequences, “.ps” files are accessed 
sequentially and probably do not need to be 
cached, etc. 
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e Compilers may be able to detect whether there 
is any 1lseek call to a file; if there is none, then 
it is very likely that the file is accessed sequen- 
tially. Compilers can also generate the list of 
future file accesses in some cases; for example, 
current work on TIP (Transparent Informed 
Prefetching)(22] is directly applicable. 


e The programmer can give hints about the ac- 
cess pattern for a file: sequential, with a stride, 
random, etc. 


When these techniques give the exact sequence 
of future references, the manager process can ap- 
ply the offline optimal policy RMIN: replace the 
block whose next reference is farthest in the future. 
Often, however, only incomplete knowledge about 
the future reference stream is known. For exam- 
ple, it might be known that each file is accessed 
sequentially, but there might be no information on 
the relative ordering between accesses to different 
files. RMIN is not directly applicable in these cases. 
However, the principle of RMIN still applies. 

We propose the following replacement policy to 
exploit partial knowledge of the future file access 
sequence: when the kernel suggests a candidate re- 
placement block to the manager process, 


1. find all blocks whose next references are defi- 
nitely (or with high probability) after the next 
reference to the candidate block; 


2. if there is no such block, replace the candidate 
block; 


3. else, choose the block whose reference is far- 


thest from the next reference of the candidate 
block. 


Depending on the implementation of two-level re- 
placement, this policy may be implemented in the 
kernel or in a run-time I/O library. Either way, the 
programmer or compiler needs to predict future se- 
quences of file references. 

This strategy can be applied to common file ref- 
erence patterns. For general applications, common 
file access patterns include: 


e sequential: Most files are accessed sequentially 
most of the time; 


e file-specific sequences: some files are mostly ac- 
cessed in one of a few sequences. For example, 
object files are associated with two sequences: 
1) sequential; 2) first symbol table, then text 
and data (used in link editing); 
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e filter: many applications access files one by one 
in the order of their names in the command 
line, and access each file sequentially from be- 
ginning to end. General filter utilities such as 
grep are representative of such applications; 


e same-order: a file or a group of files are repeat- 
edly accessed in the same order. For example, 
if “*” (for file name expansion) appears more 
than once in a shell script, it usually leads to 
such an access pattern; and 


@ access-once: many programs do not reread or 
rewrite file data that they have already ac- 
cessed. 


Applying our general replacement strategy, we 
can determine replacement policies for applications 
with these specific access patterns. Suppose the 
kernel suggests a block A, of file F, to be replaced. 
For sequential or file-specific, the block of F that 
will be referenced farthest in the future is chosen for 
replacement; for filter, the sequence of future refer- 
ences are known exactly, and RMIN can be applied; 
for same-order, the most recently accessed block 
can be replaced; and for access-once, any block of 
which the process has referenced all the data can 
be replaced. 


5.2 Simulation Environment 


We used trace-driven simulations to do a prelimi- 
nary evaluation of our ideas. Our traces are from 
two sources. We collected the first set ourselves, 
tracing various applications running on a DEC 
5000/200 workstation. The other set is from the 
Sprite file system traces from University of Califor- 
nia at Berkeley[2]. 

We built a trace-driven simulator to simulate the 
behavior of the file cache under various replacement 
policies!. In our simulation we only considered ac- 
cesses to regular files — accesses to directories were 
ignored for simplicity, the justification being that 
file systems often have a separate caching mecha- 
nism for directory entries. We also assume that the 
file system has a fixed size file cache”, with a block 
size of 8K. 

We validated our simulator using Ultrix traces. 
Our machine has a 1.6MB file cache. We can mea- 
sure the actual number of read misses using the UI- 
trix “time” command. Our simulation results were 


1Our traces and simulator are available via anonymous 
ftp from ftp.cs.princeton.edu: pub/pc. 

2That is, we do not simulate the dynamic reallocation of 
physical memory between virtual memory and file system 
that happens in some systems. 





177 


within 3% of the real result except for link-editing, 
for which the simulator predicted 7% fewer misses. 
This is because the simulator ignores directory op- 
erations, which are more common in the link-editing 
application. 

To evaluate our scheme, we compared it with two 
policies: existing kernel-level global LRU without 
application control, and the ideal offline optimal 
replacement algorithm, RMIN. The former is used 
by most file systems, while the latter sets an up- 
per bound on how much miss ratio can be reduced 
by improving the replacement policy. Our perfor- 
mance criterion is miss ratio: the ratio of total file 
cache misses to total file accesses. 


5.3. Results for Ultrix Traces 


We instrumented the Ultrix 4.3 kernel to collect our 
first set of traces. When tracing is turned on, file 
I/O system calls from every process are recorded in 
a log, which is later to fed to the simulator. 

Traces were gathered for three application pro- 
grams, both when they were running separately and 
when they were run concurrently. The applications 
are: 


e Postgres: Postgres is a relational database sys- 
tem developed at University of California at 
Berkeley[28]. We used version 4.1. We traced 
the system running a benchmark from the Uni- 
versity of Wisconsin, which is included in the 
release package. The benchmark takes about 
fifteen minutes on our workstation. 


® cscope: cscope is an interactive tool for exam- 
ining C sources. It first builds a database of 
all source files, then uses the database to an- 
swer the user’s queries, such as locating all the 
places a function is called. We traced cscope 
when it was being used to examine the source 
code for our kernel. The trace recorded four 
queries, taking about two minutes. 


e link-editing: The Ultrix linker is known as be- 
ing I/O-bound. We collected the I/O traces 
when link-editing an operating system kernel 
twice, taking about six minutes. We also col- 
lected traces when linking some programs with 
the X11 library. 


Single Application Traces First we’d like to 
see how introducing application control can improve 
each application’s caching performance: 


e Postgres: it is often hard to predict future file 
accesses in database systems. To see whether 
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Figure 7: Performance for Linking with X11 


user-level heuristics may reduce miss ratio, we 
tried the policy for sequential access pattern. 
Indeed, the miss ratio is reduced (Figure 4). 
We think that the designer of the database sys- 
tem can certainly give a better user-level pol- 
icy, thus further improving the hit ratio. 


e cscope: cscope actually has a very simple access 
pattern. It reads the database file sequentially 
from beginning to end to answer each query. 
The database file used in our trace is about 
10MB. For caches smaller than 1OMB, LRU 
is useless. The reason that the miss ratio is 
only 12.5% is that the size of file accesses is 
1KB, while the file block size is 8KB. However, 
if we apply the right user-level policy (noticing 
that the access pattern is same-order), the miss 
ratio is reduced significantly (Figure 5). 


e link-editing: the linker in our system makes a 
lot of small file accesses. It doesn’t fit the se- 
quential access pattern. However, it is read- 
once. Even though the linker is run twice in 
our traces, during each run its user level pol- 
icy can still be that of read-once. The result 
is shown in Figure 6. For linking with X11 li- 
brary, we tried both the policy for sequential 
and the policy for read-once at user-level (Fig- 
ure 7). read-once seems to be the right policy. 
(Note that this trace is small and a 4MB cache 
is actually enough for it.) 


Multi-Process Traces Having seen that appro- 
priate user-level policies can really improve the 
cache performance of individual applications, we 
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Figure 8: Performance for a Multi-Process Work- 
load 


would like to see how our scheme performs in a 
multi-process environment. 

We collected traces when the three applications 
(Postgres, cscope, linking the kernel) are run con- 
currently. In this trace, we simulated each applica- 
tion running its own user-level policy as discussed 
above. The result is shown in Fig.8. Since the appli- 
cations’ user-level policies are not optimal, we also 
simulate the case of each application using an of- 
fline optimal algorithm as its user-level policy. This 
yields the curve directly above RMIN. 

As can be seen, our scheme, coupled with appro- 
priate user level policies, can improve the hit ratio 
for multiprocess workloads. 

We also performed an experiment to measure the 
benefit of using place-holders. We collected a trace 
of the Postgres and kernel-linking applications run- 
ning concurrently, and simulated the miss ratio of 
Postgres when kernel-linking makes the worst possi- 
ble replacement choices and Postgres simply follows 
LRU. We simulated our full allocation algorithm 
and our algorithm without place-holders. Figure 9 
shows the result. Without place-holders, Postgres 
is noticeably hurt by the other application’s bad re- 
placement decisions. (With place-holders, Postgres 
has the same miss ratio as under global LRU.) 


5.4 Results for Sprite Traces 


Our second set of traces is from the UC Berkeley 
Sprite File System Traces [2]. There are five sets 
of traces, recording about 40 clients’ file activities 
over a period of 48 hours (traces 1, 2 and 3) or 24 
hours (traces 4 and 5). 

We focused on the performance of client caching. 
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Figure 9: Benefit of Using Place-Holders 


In a system with a slow network (e.g. ethernet), 
client caching performance determines the file sys- 
tem performance on each workstation. Furthermore 
we ignored kernel file activities in these traces, be- 
cause Sprite’s VM system swaps to remote files. In 
our simulation we set the client cache size to be 
7MB, which is the average file cache size reported 
in [2]. 

These traces do not contain process-ID infor- 
mation, so we cannot simulate application-specific 
policies as with Ultrix traces. However, since most 
file accesses are sequential [2], the sequential heuris- 
tic can be used. Figure 10 shows average miss ratios 
for global LRU, sequential heuristic and optimal re- 
placement. Average cold-start (compulsory) miss 
ratios are also shown. 

As can be seen, two-level replacement with se- 
quential heuristic improves hit ratio for some traces. 
In fact simulations show that sequential improves 
hit ratio for about 10% of the clients, and the im- 
provements in these cases are between 10% and over 
100%. 

Overall, these results show that two-level replace- 
ment is a promising scheme to enable application 
control of replacement and to improve the hit ratio 
of the file buffer cache. We believe that two-level 
replacement should be implemented in future file 
systems. 


6 Related Work 


There have been many studies on caching in file 
systems (e.g. [24, 5, 23, 3, 18, 20]), but these inves- 
tigations were not primarily concerned with cache 
replacement policies. The database community has 
long studied buffer replacement policies[26, 8, 21], 
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Figure 10: Averaged Results from Sprite Traces 


but existing file systems do not support them. Al- 
though user scripts were introduced in caching in 
a disconnected environment([14], they were used to 
tell the file system which files should be cached (on 
disk) when disconnection occurs. In most of these 
systems, the underlying replacement policy is still 
LRU or an approximation to it. 


In the past few years, there has been a stream 
of research papers on mechanisms to implement 
virtual memory paging at user level. The exter- 
nal pager in Mach [29] and V [6] allows users to 
implement paging between local memory and sec- 
ondary storage, but it does not allow users to con- 
trol the page replacement policy. Several stud- 
ies [18, 25, 12, 15] proposed extensions to the exter- 
nal pager or improved mechanisms to allow users 
to control page replacement policy. These schemes 
do not provide resource allocation policies that sat- 
isfy our design principles to guarantee replacement 
performance. Furthermore, they are not concerned 
with file caching. 

Previous research on user-level virtual mem- 
ory page replacement policies [1, 9, 12, 15, 27] 
shows that application-tailored replacement policies 
can improve performance significantly. With cer- 
tain modifications, these user-level policies might 
be used as user-level file caching policies in our 
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two-level replacement. Recent work on prefetch- 
ing [22, 10, 7] can be directly applied in user-level 
file caching policies using our general replacement 
strategy. These systems can take advantage of the 
properties of our allocation policy to guarantee per- 
formance of the entire system. 


7 Conclusions 


This paper has proposed a two-level replacement 
scheme for file cache management, its kernel policy 
for cache block allocation, and several user-level re- 
placement policies. We evaluated these policies us- 
ing trace-driven simulation. 

Our kernel allocation policy for the two-level 
replacement method guarantees performance im- 
provements over the traditional global LRU file 
caching approach. Our method guarantees that 
processes that are unwilling or unable to predict 
their file access patterns will perform at least as 
well as they did under the traditional global LRU 
policy. Our method also guarantees that a pro- 
cess that mis-predicts its file access patterns cannot 
cause other processes to suffer more misses. Our key 
contribution is the guarantee that a good user-level 
policy will improve the file cache hit ratios of the 
entire system. 

We proposed several user-level policies for com- 
mon file access patterns. Our trace-driven simu- 
lation shows that they can improve file cache hit 
ratios significantly. Our simulation of a multipro- 
grammed workload confirms that two-level replace- 
ment indeed improves the file cache hit ratios of the 
entire system. 

We believe that the kernel allocation policy pro- 
posed in this paper can also be applied to other 
instances of two-level management of storage re- 
sources. For example, with small modifications, it 
can be applied to user-level virtual memory man- 
agement. 

Although the kernel allocation policy guaran- 
tees performance improvement over the traditional 
global LRU replacement policy, there is still room 
for improvement. We plan to investigate these pos- 
sible improvements and implement the two-level re- 
placement method to evaluate our approach with 
various workloads. 
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Abstract 


Ficus is a flexible replication facility with optimistic 
concurrency control designed to span a wide range 
of scales and network environments. Optimistic con- 
currency control provides rapid local access and high 
availability of files for update in the face of disconnec- 
tion, at the cost of occasional conflicts that are only 
discovered when the system is reconnected. Ficus re- 
liably detects all possible conflicts. Many conflicts 
can be automatically resolved by recognizing the file 
type and understanding the file’s semantics. This pa- 
per describes experiences with conflicts and automatic 
conflict resolution in Ficus. It presents data on the fre- 
quency and character of conflicts in our environment. 
This paper also describes how semantically knowl- 
edgeable resolvers are designed and implemented, and 
discusses our experiences with their strengths and lim- 
itations. We conclude from our experience that opti- 
mistic concurrency works well in at least one realistic 
environment, conflicts are rare, and a large proportion 
of those conflicts that do occur can be automatically 
solved without human intervention. 


1 Introduction 


The value of file replication is widely recognized, but 
replication of updatable files leads immediately to con- 
sistency problems. File replicas can be partitioned 
from each other for a variety of reasons, ranging from 
failures of machines and networks to intentionally in- 
termittent connections (e.g., connection via modem, or 
replicas on portable machines that are not always at- 
tached to a network). If no efforts are taken, partition- 
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ing can permit conflicting updates to different replicas 
of a file. Much of the value of replication is based on 
all replicas being identical, so inconsistent updates are 
a potentially serious problem. 

Early solutions to the problem relied on various con- 
servative algorithms that prevented conflicting updates 
to different replicas [1]. These solutions used a wide 
variety of mechanisms, but their common theme is that 
they refuse updates that have any possibility of causing 
conflicting updates. These solutions trade availability 
for consistency. When consistency of replicas is of vi- 
tal importance, conservative solutions are preferable. 

However, experience with file access by typical users 
has shown that many files are only accessed by a single 
user [10]. Of those that are shared by multiple users, 
few are updated by more than one user. In such en- 
vironments, a mechanism that prevents one user from 
updating a file in favor of preserving the update ability 
of other users who might never generate an update is se- 
riously flawed. Conservative replication mechanisms 
exhibit this flaw. 

Optimistic replication mechanisms do not. They al- 
low any replica of a file to be updated at any time. 
This choice ensures that users who need to update a 
file can do so when any replica is available. However, 
optimistic mechanisms gain this availability by trad- 
ing off consistency. Since any replica can be updated, 
two non-communicating replicas can be changed in- 
dependently, leading to conflicts, i.e., different replica 
contents. To maintain consistency, a system with op- 
timistic replication must detect and recover from such 
conflicts. 

The improved availability of optimistic systems 
must be weighed against the frequency and cost of re- 
covering from conflicts. A hypothesis of this paper is 
that the cost of optimism is low in many environments. 

To test this hypothesis, this paper reports conflict 
resolution experiences with Ficus, an optimistic file 
system developed at UCLA [4]. Ficus has supported 
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the primary computing needs of over a dozen users at 
UCLA for more than three years. 

Ficus has a general architecture for dealing with file 
conflicts. Conflicts are automatically detected and ex- 
amined to determine if they can also be resolved au- 
tomatically. Special programs called resolvers handle 
conflicts that can be dealt with automatically. 

Ficus is able to resolve almost all conflicts for a par- 
ticularly important class of files—directories. Ficus 
supports a Unix-style directory system; the semantics 
of Unix directories provide that almost every conflict 
that can occur in them can be automatically resolved. 
Directory conflicts could be resolved by submitting 
them to a resolver that implements the algorithms nec- 
essary to resolve their conflicts, but the integrity of the 
Unix file system is so closely linked to its directories 
that we have chosen to put the algorithms into Ficus 
itself. 

We have instrumented the Ficus system to keep 
track of the number of conflicts generated and how 
many conflicts were resolved automatically. This pa- 
per presents the statistics gathered, which support the 
contention that, for important patterns of usage, the fre- 
quency of file conflicts is low enough that optimistic 
replication is highly attractive. Further, the statistics 
demonstrate that the use of automatic resolvers is both 
practical and important to reduce the number of con- 
flicts reported to users. 

The next section presents an overview of the Ficus 
file system. We begin there with an example of how 
a conflict can arise in an optimistic replication system, 
discuss the different kinds of conflicts that can arise, 
briefly describe how conflicts are detected, and cover 
other relevant aspects of Ficus. Section 3 describes 
the Ficus resolution architecture. It discusses the au- 
tomated directory conflict resolution mechanisms in 
Ficus and describes how Ficus handles other types of 
conflicts. Section 4 discusses Ficus conflict resolver 
programs. It covers their interface and the various ap- 
proaches used to resolve conflicts for different types 
of files. Section 5 presents conflict data gathered from 
Ficus; Section 6 discusses some related research. We 
close with a discussion of future work and some con- 
clusions. 


2 Ficus Overview 


Ficus is a distributed file system utilizing optimistic 
replication [16, 4]. The default synchronization policy 
provides single copy availability; so long as any copy 
of a data item is accessible, it may be updated. Once 
a single replica has been updated, the system makes a 
best effort to notify all accessible replicas that a new 
version of the file exists via update propagation. Those 


replicas then pull over the new data. Ficus guarantees 
no lost update semantics despite this optimistic con- 
currency control. Conflicting updates are guaranteed 
to be detected, allowing recovery after the fact. 

Ficus groups subtrees of files into volumes. A vol- 
ume can be replicated multiple times. A background 
process known as reconciliation runs on behalf of each 
volume replica after each reboot and periodically dur- 
ing normal operation. It compares all files and directo- 
ries of the local volume replica with a remote replica of 
the volume, pulling over missed updates and detecting 
concurrent update conflicts. 

Several types of conflicts are possible. They include: 


e Update/update conflicts 
e Name conflicts 
e Remove/update conflicts 


The remainder of this section will discuss these types 
of conflicts in more detail. The next section describes 
how we manage these conflicts. 

Since single copy availability permits any replica to 
be updated, even a simple partitioning of a two-replica 
file can result in a conflict. Figure 1 illustrates this 
situation. File foo has two replicas in Figure la, with 
replica 1 at site A and replica 2 at site B. If sites A 
and B are partitioned, as Figure 1b shows, updates to 
both replicas are accepted. Then, when the partition 
is merged, as shown in figure Ic, file foo exists in two 
versions. This is an update/update conflict. 

Directories provide a special case of update/update 
conflicts. Partitioned creation of independent files in 
the same directory would ordinarily result in an up- 
date/update conflict on that directory. Since directo- 
ries are internal to the file system, Ficus automatically 
resolves this sort of concurrent update, producing the 
union of all directory changes. (See [6] for a descrip- 
tion of the algorithms employed in directory manage- 
ment.) A problem occurs when two files are indepen- 
dently created with the same name; Unix requires that 
each directory entry be unique. We term is kind of 
directory update/update conflict a name conflict. 

Figure 2 illustrates a different kind of conflict. In fig- 
ure 2a, we see two replicas of file foo before a partition. 
In 2b, file foo is removed at site B (indicated by the 
shading of site B’s replica), while the partitioned rep- 
lica at site A is updated. When the partition merges, as 
shown in 2c, if no update had occurred, then the other 
replica should simply be removed. However, if the 
updated replica is removed in this situation, the update 
generated during the partition is lost, possibly without 
the knowledge of the person making the update. Fi- 
cus’ “no lost update semantics” requires that the update 
generated at site A not be discarded as a result of the 
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Figure 1: An update/update conflict. 


removal of the file at site B. This kind of conflict is 
called a remove/update conflict. 


If a file is independently deleted from two replicas 
of a partitioned directory, Ficus does not log a conflict. 
This delete/delete situation is not a problem, provided 
any other replicas of the file are also deleted when the 
partitions merge, since both deletions have precisely 
the same effect. 


Ficus makes a “best-effort” attempt to propagate up- 
dates as they occur. However, even when no partition- 
ings or other machine failures happen, update propa- 
gation is not guaranteed. Thus, conflicting updates can 
arise even without machine or network failures. Also, 
Ficus does not lock replicas for update even within a 
partition, so two replicas can accept simultaneous up- 
dates to a file that could result in a conflict. In practice, 
the update propagation mechanism is fast and reliable 
enough that conflicts unrelated to actual failures or par- 
titioning almost never occur. 

Ficus detects all types of conflicts using a mecha- 
nism known as a version vector [14]. Each file replica 
maintains its own version vector that keeps track of the 
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history of updates to the file. Conflicts are detected 
by comparing version vectors from two file replicas. 
Version vectors reliably detect all file conflicts that in- 
volve replicas of a single file. They do not assist in 
ensuring the consistency of updates that span multiple 
files. Other mechanisms (not supported in Ficus, nor 
in most file systems) are required to do so. 


3 Ficus Conflict Resolution Architecture 


Several types of conflicts are possible in Ficus. Be- 
cause of the importance of the integrity of directo- 
ries, directory conflicts receive special handling. Re- 
move/update conflicts also require some special treat- 
ment. Update/update conflicts on non-directory files 
are the most common case. The following subsections 
discuss each type of conflict and its handling in more 
detail. 
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Fig. 2b — During partition 


Foo 


Replica 1 


Conflict 





Fig. 2c — After partition 


Figure 2: A remove/update conflict. 


3.1 Directory Conflicts 


The integrity of a Unix file system depends on its di- 
rectories. If a directory cannot be used because it has 
received conflicting updates, a portion of the file sys- 
tem’s name space may become inaccessible. Thus, 
conflicts in directories are very serious. Either they 
must not occur often, or they must be resolved auto- 
matically. 


Ficus directory conflicts are repaired automatically 
during reconciliation. As shown in [4, 3], all conflicts 
that can occur in a Unix directory can be automatically 
resolved, except for name conflicts. A complete de- 
scription of directory reconciliation algorithms is avail- 
able in these references, so we discuss only their broad 
outlines here, as an example of how known semantics 
of files can be used to resolve conflicts. 


Unix directories support only two operations: a pro- 
cess can add a name to the directory or remove a name 
from a directory. Creation of a file adds a name to a 
directory (in addition to creating data structures to rep- 
resent the file itself). Creating a hard link to a file adds 


a second name for the file. File contents are discarded 
only when the last name for the file has been removed. 
While mechanically rename is an atomic operation in 
many Unix systems, semantically it can be treated as a 
remove followed by a create. 

A number of issues, such as handling arbitrary pat- 
terns of failures and recoveries, distinguishing between 
creation and deletion of entries, and avoiding central- 
ized algorithms, make the problem of directory man- 
agement substantially more complex than it seems at 
first glance. In broadest principle, the automatic recon- 
ciliation mechanisms for directories examine all entries 
in both versions of the directory in conflict, determine 
which entries are common to both, and, for an entry 
that is present in only one directory, determine whether 
the file was created or deleted during partition. Ficus 
keeps sufficient information to distinguish precisely the 
patterns of file entry additions and deletions while par- 
titioned, which in turn allows all possibly conflicting 
updates to be addressed. 

One class of conflicting updates can create another 
problem, however. Concurrent creation of different 


files with the same name results in a directory with two 
identical, effectively indistinguishable names. Ficus 
detects that the two files are actually different, but the 
Unix directory model does not permit different files to 
have identical names, so some action is required. Fi- 
cus appends unique suffixes to each name and invokes 
name conflict resolvers to handle the situation. Like 
other conflict resolvers, if automatic resolution fails, a 
default resolver notifies the file owner, who must either 
rename or remove one of the files. 


3.2 Remove/Update Conflicts 


Remove/update conflicts are handled specially. Ficus 
is able to recognize such cases, again using version 
vectors. Ficus’ “no lost update” semantics requires 
that an remove/update conflict not result in the loss of 
the update. On the other hand, all names for the data 
have been removed, so Ficus should not permit the file 
to remain available via those names. Ficus’ solution 
to this problem is to move the file into a special direc- 
tory called an orphanage. Each volume has its own 
orphanage directory located under its root directory. 
When the reconciliation process moves a file into an 
orphanage, electronic mail is sent to the owner notify- 
ing him, allowing him to decide whether to keep the 
updated file or discard it. 


3.3 Update/Update Conflicts 


Some update/update conflicts for non-directory files 
can be resolved automatically and some cannot, de- 
pending on the semantics associated with the file. Ficus 
has the ability to invoke various resolvers to attempt 
to handle file conflicts. Ficus allows individual users 
to specify how they would like their conflicts resolved, 
but also provides a default system for resolving con- 
flicts when users have not specified their own methods, 
or when the user’s methods fail. 


3.4 Conflict Resolution 


The Ficus reconciliation process runs through the files 
in two replicas of a volume, examining each file to de- 
termine if it has been modified since the last successful 
reconciliation. When it discovers conflicting versions 
of those file replicas, the reconciliation process marks 
the file as “in conflict.”” After marking the file, recon- 
ciliation invokes resolvers to attempt to fix the conflict. 
As long as the file is in conflict, normal operations on 
the file under its usual name will fail. Each replica of 
the file can be accessed by special mechanisms, and 
Ficus provides a tool to clear the conflict. Ficus re- 
solver programs must use these capabilities and tools 
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/bin/newsrc_resolver 
/bin/fortune_resolver 
/bin/man_resolver 


\.newsrcec 
fortunes\.dat 
man/cat [1-9] 
\. history 


/bin/arbitrary_resolver 


\.bash_history /bin/arbitrary_resolver 


‘els /bin/prolog_resolver 
.* /bin/default_resolver 


Figure 3: A resolver selection file. The left column is 
the pattern to match against the conflicted file’s path- 
name. The right column is the resolver that is invoked 
in an attempt to fix the conflict. 


along with semantic knowledge of particular file types 
to resolve conflicts. 

Ficus selects a resolver to use for a particular con- 
flicted file by searching a personal and a system-wide 
resolver list. Entries in the system-wide list include re- 
solvers for common file types. Personal resolver lists 
specify resolvers for file types unique to an individ- 
ual. Personal resolver lists also allow an individual to 
choose between safety and convenience by optionally 
enabling resolvers that don’t preserve all data. For ex- 
ample, some users don’t care about conflicts on backup 
files left from editors and so have a resolver arbitrarily 
select one of conflicting backup files. More conserva- 
tive users may wish to have data in backup files com- 
pletely preserved and so may invoke a resolver saving 
each replica of the backup file as separate files. 

Conflict resolvers almost always require knowledge 
about the type of the file being resolved. Since Unix 
systems do not provide a typed file system, Ficus infers 
file types from file names and from type-recognition 
programs that examine file contents and attributes. 

The reconciliation process that examines the resolver 
lists matches only on file-name-to-regular-expression 
comparisons. Because regular expressions are used, 
matches can be exact or on substrings. Figure 3 shows 
a portion of a resolver file. Whenever a conflicted file 
named .newsrc or a file whose pathname contains 
man/cat is encountered the newsrc or manual-page 
resolvers are invoked. All resolvers shown in this 
figure have been implemented with the exception of 
the prolog_resolver. 

Unfortunately, simple file name comparison cannot 
reliably identify all file types. For example, the file 
csh.1 might bea manual page in certain contexts and 
shell script in others. To support more sophisticated 
file type identification, a resolver list might use more 
intelligent programs to check the file type. Continuing 
with the example in Figure 3, the prolog_resolver 
could abort if invoked to resolve a Perl program (whose 
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files end with the same extension), by recognizing that 
certain constructs in the file were probably not legal 
Prolog. If a resolver aborts other resolvers are invoked 
in turn until one succeeds. 

Separate resolver lists are provided for name con- 
flicts and file conflicts. In retrospect, the data-specific 
actions taken in the case of a name conflict and a file 
conflict are often quite similar; only details about re- 
solving the conflict differ. However, in certain cases it 
is important for a resolver to know whether it is deal- 
ing with a name conflict or an update/update conflict. 
In the future we plan to merge the different resolver 
lists and specify the conflict type as an argument to the 
resolver. 

Ficus allows files to be replicated any number of 
times, so it is possible that a given file might have 
three or more conflicting replicas. The Ficus reconcili- 
ation mechanism works on only two replicas at a time, 
though, so the conflicting replicas will be dealt with 
in a pairwise fashion. This simplifies the writing of 
resolvers, since they need only deal with the common 
case of exactly two conflicting replicas, rather than 
an arbitrary number. All conflicts involving multiple 
replicas can be regarded as multiple pairwise conflicts, 
so no power is lost. Also, often not all of the con- 
flicting replicas are simultaneously available. Since 
reconciliation runs between two sites known to be in 
communication, at least the pair of replicas they store 
are guaranteed to be available. 

How resolvers fix conflicts depends on the seman- 
tics of the file in question. Section 4 discusses a variety 
of the existing Ficus conflict resolvers. Typically, re- 
solvers read the data contained in both versions of the 
file, update one version of the file on the basis of both 
versions, then update the version vector of that replica 
to dominate the other, clearing the conflict. 


4 Conflict Resolution Strategies and 
Examples 


Experience with the Ficus conflict resolution mecha- 
nism has shown that there are broad classes of file con- 
flicts that can be automatically resolved. This section 
discusses them, presenting examples of each. We make 
no claim that the list is exhaustive—in fact, we are 
sure it is not, since it simply demonstrates the classes 
of conflicts that have occurred frequently enough in 
our environment to draw the attention of conflict re- 
solver writers. Further investigation, especially work 
in different computing environments, will undoubtedly 
reveal other classes of file conflicts amenable to auto- 
matic resolution. 

In several important cases, much of the potential 


work of resolving conflicts is done by Ficus itself. As 
mentioned, Ficus resolves most directory conflicts au- 
tomatically. Thus, any application that makes substan- 
tial use of the Unix directory structure has much of its 
conflict resolution problem automatically solved. Two 
important examples are Ficus graft points and MH mail 
directories. 

Ficus divides its file space into volumes, each of 
which is connected to a single place in the file hierar- 
chy. That place is called the volume’s graft point, sim- 
ilar to a Unix mount point. The graft point must keep 
information about all the replicas of the volume, includ- 
ing each replica’s storage site and other bookkeeping 
information. Since Ficus uses a Unix directory with 
one directory entry per graft point entry, graft point 
conflicts can be resolved automatically. For instance, 
if a new replica is added on each side of a partition, 
when the partition merges the graft point will automat- 
ically be resolved to indicate that both new replicas are 
available. Graft points do not experience name con- 
flicts because the tools that update them never generate 
identical graft point entries. 

The MH mail application also makes substantial use 
of directories. In MH, messages are organized into 
folders, which are implemented as directories. Most of 
the conflicts that could occur to MH folders during a 
partition are thus resolved automatically. For example, 
if a user re-files mail messages into different folders on 
both sides of a partition, the Ficus directory conflict res- 
olution mechanism would handle most of the resulting 
conflicts. Only name conflicts occasionally caused by 
re-filing messages into the same position in two repli- 
cas of a given folder require user attention. If numeric 
identity of messages is not considered important, even 
these conflicts can be automatically resolved. 

Another type of conflict that Ficus resolves auto- 
matically is conflicts on files whose contents can be 
automatically reconstructed. Control files used by the 
MH mail system are an example. These files maintain 
sequence and context mechanisms. They can, and of- 
ten are, built as needed by MH. The only requirement 
is that MH generally expects something to be there—it 
is not prepared to deal with a totally absent file, though 
it can deal with a file that does not contain very useful 
information. Thus, to resolve conflicts on these files, 
the file contents are truncated. The next time MH is 
run, the file will be reconstructed with a default context. 

Many types of files are not important to most users. 
For example, many users do not care about core files 
produced by Unix processes that fail, or about backup 
files produced by some Unix programs. Users who do 
not care about such files can put lines in their personal 
resolver files that either delete all such files when they 
get in conflict, or choose one of the conflicting replicas 


arbitrarily, or choose the replica with the later date. 
However, since some users do not want to lose some of 
their core or backup files without their knowledge, the 
system resolver file does not impose these decisions on 
users. 

Some files are monotonically increasing logs of in- 
formation. An example is the .newsrc file listing 
what newsgroups and articles have been read. The 
message numbers listed as read in each newsgroup 
usually increase monotonically. In the case of truly 
monotonically increasing logs, resolving conflicts is 
simple. The post-resolution version of the conflicting 
file simply contains the high water mark for each entry. 
If the file keeps exhaustive lists of items, the resolved 
version merges items from both conflicting versions. 

In the particular case of .newsrc files, the situ- 
ation is a little more subtle. The semantics of what 
can be changed in a .newsrc file is a bit richer than 
simply updating a record of articles seen. The user 
can subscribe or unsubscribe to newsgroups, for exam- 
ple. Some of these actions remove information from 
the file, making perfect conflict resolution impossi- 
ble. The Ficus conflict resolver for .newsrc files 
thus must make some choices. It generally errs on the 
side of information preservation, presenting users with 
more news rather than with less. For instance, if one 
version ofa .newsrc file indicates that a newsgroup is 
not subscribed to, and the other version indicates that 
it is, the conflict resolver subscribes the user to that 
newsgroup. The user can easily unsubscribe again, if 
that was truly what he wanted to do. If the system 
had left him unsubscribed when he had just recently 
subscribed, however, the user might not notice that his 
subscription had been invisibly revoked. This is an ex- 
ample of the create/delete ambiguity described in [5]. 
In some cases, taking one possible action and reporting 
the action taken to the user may be sufficient. 

The .newsrc resolver shows a typical character- 
istic of many resolvers. Often, it is relatively easy to 
produce a resolver that is right the great majority of the 
time, but occasionally makes a mistake. Producing one 
that is right all of the time, on the other hand, may be 
very difficult, or even impossible. A reasonable strat- 
egy in such cases is to write a resolver anyway, as long 
as the resolver can do something in the tricky cases 
that will not produce disastrous results. If the results 
are merely inconvenient in the rare cases when they’re 
not necessarily right, then the resolver has solved the 
conflict correctly most of the time, and caused little 
more trouble than not solving it at all the rest of the 
time. Since the alternative to this choice is notifying 
the user to solve it himself, this approach is attractive. 

In some cases, the semantics of a file are quite sim- 
ple. Score files for some of the popular Unix games are 


one such case. These files typically keep the top scores 
in sorted order. Ficus has conflict resolvers for many 
such games that sort and merge the two conflicting 
versions, removing duplicates. The case of game score 
files does bring up a difficulty with writing general re- 
solvers, however. While each of the game score files 
contains substantially the same kind of information, the 
actual format is sufficiently different that writing a sin- 
gle resolver to handle all of them is difficult. Instead, 
Ficus has a class of very similar resolvers to deal with 
the peculiarities of each. 

In other cases, conflicts can be solved simply by 
merging the two versions of a file into one, preserving 
all data in both. Doing so may cause some data to be 
duplicated, but many programs are able to handle such 
duplications without problems. One such case is the 
xcal program, an interactive window-based calendar 
manager. Conflicted xcal data files can be resolved 
simply by concatenating the two versions into one. The 
Ficus resolver includes a comment line indicating what 
happened, should the user care to clean up further, but 
the xcal program can go ahead and work with the 
merged version. 

When the Ficus resolver files cannot resolve a con- 
flict themselves, they call a final resolver (called the 
generic resolver) that notifies the user via electronic 
mail that a conflict occurred. The conflict is left unre- 
solved until the user gives it personal attention. 

In the UCLA environment, every replica of a vol- 
ume is reconciled with another replica every hour. In 
the case of unresolvable conflicts, users might be bom- 
barded with hourly messages about unresolved con- 
flicts that hadn’t been fixed. If a user did not log in 
over a weekend, fifty or more messages could accumu- 
late in his mailbox telling him about a single conflict. 

This problem is prevented by keeping track of unre- 
solved conflicts that have been brought to the owner’s 
attention already. When the generic resolver sends out 
a message about a conflict, it also logs the conflict in 
a per-volume conflict log file. The next time the re- 
solver notices this conflict, it also reads the entry in 
the conflict log and determines that it need not send 
out another message to the user. The conflict log suc- 
cessfully limits the number of conflict report messages 
users receive. 

However, since the conflict log is replicated in its 
volume (for very good reasons), this log itself can expe- 
rience conflicts. Therefore, the conflict log itself needs 
a conflict resolver. This resolver is another example 
of how one can easily write a resolution mechanism 
that is correct almost all of the time, even though it 
is hard to write one that is always absolutely correct. 
The conflict log resolver must make sure that a given 
conflict is reported only once, but also that all conflicts 
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are reported. It does so by reading both versions and 
writing a new version that contains all lines in either 
conflicted version. In case of any problems that can- 
not be automatically solved, the conflict log resolver 
simply removes both conflicting versions. Should that 
happen, the next time a conflict is detected a new con- 
flict log will be created. The user will receive another 
message for each unresolved conflict in the volume, but 
no conflicts go unreported and only one extra message 
per conflict is sent. 

Most of the existing Ficus conflict resolvers are writ- 
ten in Perl. Some are written in C, and some in other 
scripting languages. Generally, resolvers can be writ- 
ten in any language, provided they accept the param- 
eters that the reconciliation process passes to them, 
and they return a value indicating success or failure. 
So far, the processing a typical resolver must per- 
form has proven particularly suitable to Perl, in that 
resolvers frequently perform pattern matching, sort- 
ing, and merging, all functions that are provided by 
Perl. In most cases, the files in question are small, 
so the greater processing speed C could provide is not 
important. 

There are undoubtedly many other types of files 
whose conflicts can be resolved automatically. Our 
approach is to first write resolvers for several known 
problems areas, then to write resolvers for conflicts 
that actually come up in practice. Thus, the set of re- 
solvers used at UCLA gradually grows as new types 
of files generate conflicts and people tire of solving the 
conflicts by hand. At the moment, we have about 15 
different resolvers, some of which are used to resolve 
multiple file types. 


5 Data on Conflict Occurrence and 
Resolution 


The Ficus file system has been running as the primary 
development environment for Ficus itself for several 
years. Recently, we began to gather data about the oc- 
currence of conflicts in Ficus. This data was gathered 
by logging every conflict detected by the reconciliation 
processes and tracking conflict resolution. In addition 
to recording conflict detection and resolution, we also 
recorded the total number of updates made to deter- 
mine the relative frequency of conflicting updates in 
our environment. 

One shortcoming of this data is that most indepen- 
dent directory updates are not detected by this instru- 
mentation. We detect all name conflicts, but do not 
detect the much more common case of independent 
creation of two differently named files. Such update 
“conflicts” are automatically resolved by Ficus direc- 


tory resolution algorithms. We know that many such 
cases have arisen and have required this automatic res- 
olution code. For example, many programs create tem- 
porary files in a user’s home directory. Such programs 
would have created many conflicting directory updates 
between home-use and office machines were it not for 
automatic directory reconciliation. Unfortunately, this 
sort of conflict is not represented in our statistics, and 
we currently cannot precisely estimate the frequency 
of this situation. 

The nature of the environment has a strong influ- 
ence on how often conflicting updates will occur. An 
environment in which almost all the machines are con- 
nected almost all the time will generate relatively few 
conflicts. An environment in which some machines 
are often disconnected will generate more conflicts. 

The UCLA environment contains approximately a 
dozen Sun workstations, each with a regular user, shar- 
ing a replicated namespace over an Ethernet. The net- 
work connection rarely fails. However, since Ficus is 
an experimental file system built into the kernel and 
undergoing continual change, the machines running it 
crash or are voluntarily rebooted much more often than 
most workstations. Machines going up and down ef- 
fectively create partitions as easily as network failures 
do. 

Two of our workstations are located at project mem- 
bers’ homes and are only rarely connected to the net- 
work. These primarily disconnected machines store 
replicas of volumes important to their users. These ma- 
chines and their volumes communicate with the core 
Ficus hosts only rarely and only to exchange updates 
via reconciliation. Although one might expect this 
pattern of usage to result in very high conflict rates, 
surprisingly it does not. One reason is that replica rec- 
Onciliation is scheduled to coordinate the movement 
of the users with the data of the system, effectively 
allowing the system’s user to act as a human “write 
token” [7]. While this behavior avoids many conflicts, 
nevertheless the conflict rate in mostly disconnected 
volumes is much higher than in other volumes. 

Table 1 shows the conflict statistics for more than 
nine months of operation in the UCLA environment. 
About 0.0035% of all non-directory updates resulted 
in conflicts. 

During the period under measurement, several con- 
flict resolvers were added to our suite. Using the re- 
solvers available at the end of the measurement pe- 
riod, 162 of the update/update conflicts (roughly, one 
in three) experienced could have been resolved auto- 
matically if the same patterns of conflicts occured today 
as did during this nine month period. This set of re- 
solvable conflicts includes files related to the MH mail 
system, shell history and editor backup files, .newsrc 
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14,142,241 total non-directory updates 
14,141,752 non-conflicting updates 
489 update/update conflicts 
162 automatically resolved 
176 resolvable automatically 
151 not clearly resolvable automatically 


98 update/remove conflicts | 
98 passed to the user for resolution 


708,780 name creations 
708,652 non-conflicting name creations 
128 name conflicts 
128 passed to the user for resolution 


Table 1: Conflict statistics for a nine month period. 
Theoretically resolvable conflicts are conflicts on files 
with semantics amenable to automatic resolution but 
for which we have not yet written resolvers. 


files, and several types of game score files. 

Many of the other conflicts experienced could have 
been handled by resolvers that have not been written. 
We found 176 of those, more than another third of the 
total number of conflicts. These include control files 
for the trn news reader, saved news postings, manual 
pages, compiler-produced object files, measurement 
statistics files, and score files for other games. 

The remaining 151 conflicts would not be easy to re- 
solve automatically with our current system. Files that 
occasionally got into conflict that cannot be resolved 
include such things as source code and arbitrary text 
files. A significant number of these conflicts occured 
in files placed in orphanages. Such files no longer 
possess their original names. Since most of our exist- 
ing resolvers are selected solely by name, our current 
system has little hope of finding a proper resolver for 
these files. Explicit storage (or identification) of file 
type would make resolution of these files possible. 

None of the name conflicts were resolved automat- 
ically. Because many fewer name conflicts occurred 
than file conflicts, we did not develop any name con- 
flict resolvers in the sample period. About 0.018% of 
all name creations led to name conflicts, all of which 
were resolved by human users. On the average, each 
user had to deal with about ten name conflicts during 
this nine month period. 

Taken as a whole, the average user in this environ- 
ment thus had to resolve about five conflicts a month, 
and examine one update/delete conflict per month. In 
actuality, this average is misleading, since conflicts 


tended to happen more often to users who worked with 
the disconnected machines. A few users thus expe- 
rienced much higher conflict rates, while many users 
encountered considerably fewer conflicts than the av- 
erage. 

The conflicts were not evenly spread across all vol- 
umes in our environment. Table 2 shows the number 
of updates and conflicts for different types of volumes, 
and the conflicts rates by volume for the nine month 
period. 

Volumes are classified as either shared or private, 
and either office, disconnected, or network. Shared 
volumes indicate volumes that receive heavy update 
traffic from multiple users, often to different volume 
replicas. A prominent example of this category of 
volume is the games volume. Updates to the games 
volume involve access to shared database files (game 
score files). Multiple users accessing different replicas 
and using the same application concurrently create high 
probability of conflicts. In addition, the games volume 
is a disconnected one, meaning that replicas exist both 
in the office and at users’ homes. Disconnection in- 
creases the likelihood of concurrent use, for the time 
period in which independent updates are deemed con- 
current is increased. Fortunately, the shared database 
files have relatively simple semantics, so it is easy to 
write automatic resolvers for these files. Other discon- 
nected, shared volumes included volumes of installed 
programs and libraries, which easily get in conflict if 
users are not careful about how they perform installa- 
tions. 

The source code volumes are another example of 
shared volumes, though these are stored entirely in the 
office. Although most source code files themselves are 
protected against multiple writers by a revision control 
service, conflicts can occur in two ways. First, multiple 
users can attempt to gain write permission on the same 
file via different replicas. Second, the same user can 
perform updates to two different replicas. The latter is 
not nearly as uncommon as it would seem, since source 
code volumes are replicated on server-style machines, 
which experience more down-time than normal work- 
stations due to increased load. Server crashes cause 
automatic replica switching, creating the potential for 
conflicts: a user updates one replica, then switches 
replicas and updates the second. 

Private volumes are user’s personal volumes. They 
are almost always updated solely by the one user, and 
experience very few conflicts. However, when the 
private volumes are also disconnected, conflict rates 
rapidly increase. Examples of such volumes are the 
personal volumes of two Ficus project members who 
have replicas at the office and at home. Home for one 
user is across town, and home for the other is across the 
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Volume Number of Volumes Numberof Number of Conflict User- Visible 
Classification in class Updates Conflicts Rate Conflict Rate 
Disconnected, Shared 9 1,114,855 273 .0245% .0052% 
Disconnected, Private 16 387,523 106 .0274% .0012% 
Office, Shared - 6,316,331 66 .0010% .0005% 
Office, Private 48 6,286,754 44 .0007% .0003% 
Network, Shared 8 36,778 0 0% 0% 


Table 2: Update/update conflicts grouped by volume classification. 
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ocean in Guam. They both call the office by modem 
and reconcile periodically, ranging from once a day to 
once every few weeks. 

Most of the private volumes are rarely disconnected, 
and therefore one would expect there to be almost no 
conflicts, since only one person is performing updates 
and usually to the same replica. Private volumes stored 
only in the office accounted for only 9% of the total 
conflicts. 

Network volumes are volumes shared between sites 
connected by the Internet. These volumes are all 
shared, in our current environment. Many of them 
are test volumes, leading to a low number of updates 
for such volumes. 

As expected, disconnected volumes had a much 
higher rate of conflicts, 20 to 40 times as high as their 
office counterparts. Somewhat surprisingly, however, 
disconnected shared volumes suffer a lower conflict 
rate than private volumes. 

Table 2 also indicates which of these conflicts could 
have been resolved automatically. For example, all 
of the conflicts on the game volume were on simple 
database score files, and therefore easily resolved. The 
only conflicts that the user need see in the disconnected, 
shared class of volumes were those in the installation 
volumes. Most of the conflicts on source code files 
in the office shared class could not be resolved au- 
tomatically, however, because source code files have 
arbitrary semantics, and therefore require user inter- 
vention. Those conflicts that were resolvable were on 
object files (.o files). 

The unresolvable conflict rates for disconnected vol- 
umes are still significantly higher than the unresolvable 
conflict rates for office volumes, but the relative dif- 
ference is somewhat lower, particularly in the case of 
private volumes. Many of the files in disconnected 
private volumes that get into conflict are simple data- 
base files that can be resolved automatically. Shell 
histories are again a good example. In the discon- 
nected environment, they are likely to frequently enter 
conflict, while they rarely enter conflict in the office 
environment. But these conflicts are always automat- 


ically resolvable. Once automatic resolution is taken 
into account, instead of one disconnected private vol- 
ume update in five thousand requiring user attention, 
one in one hundred thousand requires it. This twenty- 
fold reduction in user intervention in conflicts on this 
class of volume is a powerful motivation for providing 
automatic resolution in the disconnected environment. 


6 Related Work 


The Ficus file system draws from several earlier sys- 
tems, and has some similarities to work done by others. 
This section discusses some of the related work, with 
particular attention to that concerning optimistic rep- 
lication, conflicts in optimistically replicated systems, 
and automatic resolution of such conflicts. 

Parker’s work on version vectors was an important 
early step in optimistic file replication [14]. It permit- 
ted reliable detection of independent updates to differ- 
ent replicas of a data item with limited and reasonable 
costs for maintaining the necessary information. 

Version vectors were used in the university Locus 
operating system [15, 17], a system that provided data 
replication and dealt with partitioned operation. How- 
ever, the Locus system never dealt substantively with 
the problems of conflicting updates. 

Sergio Faissol’s Ph.D. dissertation examined this 
question in the context of databases [2]. He inves- 
tigated several classes of information that could be 
stored in a database, how independent updates to those 
classes of information could be reconciled, and the 
information required to perform the necessary recon- 
ciliation. His work was primarily theoretical, and was 
never applied to file systems. 

The Coda project at Carnegie-Mellon University, 
discussed in detail in [18], is also developing an opti- 
mistically replicated file system. The Coda developers 
have considered the questions of disconnected opera- 
tions in a somewhat different context than the Ficus 
system. They support a highly connected backbone 
of server machines that replicate files. While these 
servers may occasionally fail or become disconnected 
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from each other, they are expected to be more reli- 
able than the typical single-user workstation machine. 
Client machines cache replicas of files they actually 
use, and send the updates back to a server replica [10]. 

The nature of the Coda system makes partitioned 
first class replicas a less common event than in Ficus. 
Partitioned update is far more common between first 
and second class Coda replicas where simpler reconcil- 
iation algorithms are possible [10]. References [11, 12] 
discuss Coda’s log-based approach to conflict resolu- 
tion between first-class replicas. The design of conflict 
resolution in Coda is described in [13]. Like the Fi- 
cus approach, conflict resolvers are provided and are 
selected by file type. 

Unlike Ficus, the Coda approach uses files that hold 
resolution rules that apply to all files in a directory or 
its subdirectories. These rules are similar in form to 
rules in a Unix makefile. By placing a set of generic 
rules in the topmost directory, Coda can achieve the 
same effect as Ficus’ system resolver file. By using 
regular expressions that match only certain directory 
prefixes, the Ficus resolver files can achieve the same 
effect as Coda’s per-directory rules files. Unlike the 
Ficus approach, Coda does not automatically serially 
apply different resolvers to a file in conflict, though 
presumably the makefile rules could be set up in such 
a way that they could. Generally speaking, the expres- 
sive power of the Coda and Ficus approaches seem 
similar. More experience with both systems is needed 
to determine if either approach has a clear advantage in 
user friendliness. The statistics presented in this paper 
provide the first step at addressing some of these issues. 

Huston and Honeyman describe their approach to 
optimistic replication in disconnected AFS in [9]. This 
system permits updates to cached copies of data at 
disconnected client sites under AFS. Writes generated 
by a disconnected client site are logged and replayed 
when the client is reconnected to a server. If any of the 
logged write operations conflict with writes performed 
by some other client during the disconnection, the con- 
flicts are detected and reported. No attempt is made to 
automatically resolve them, though Huston and Hon- 
eyman do briefly discuss plans to provide tools to help 
users resolve common types of conflicts. 

Howard has developed an optimistic reconciliation- 
based system to permit occasionally connected ma- 
chines to share files [8]. He reliably detects conflicts 
using a journalling mechanism, but currently makes no 
attempt to reconcile them. 


7 Observations and Conclusions 


Optimistic file replication in an environment that has 
any serious degree of disconnection benefits from au- 
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tomatic conflict resolution; it can substantially reduce 
the conflict rate observed by users. We present data for 
two environments, a usually-connected office environ- 
ment and a periodically connect, usually-disconnected 
home use environment. 

In the office environment, without automatic con- 
flict resolution, the typical user would need to resolve 
around two conflicts per month, considering both up- 
date/update and name conflicts. With automatic resolu- 
tion, the frequency of conflicts requiring user attention 
would drop to one and a half or less. The resolvers 
Ficus currently has installed and that will be added to 
our suite can reduce the total number of user-visible 
conflicts by about one half. 

The effects are more dramatic in the home use en- 
vironment. In this environment, two users generated 
380 conflicts in 9 months, averaging nearly a conflict a 
day for each user. In actuality, one of the two users ex- 
perienced the bulk of the conflicts. He made extensive 
use of disconnected home computing, reconciling his 
volumes only once a day or so, so his conflict rate was 
significantly higher. He observed 30 to 40 conflicts per 
month. Applying automatic resolvers to the home use 
environment reduces the observed conflict rate for this 
user to around seven conflicts per month. 

The in-office statistics might suggest that the added 
value of automatic resolution of some conflicts is not 
that great, in that environment. However, there are 
some additional points to consider. First, as pointed 
out in Section 5, we did not gather statistics for the 
value of the most important case of all, the automatic, 
built-in directory resolver. (We hope to gather these 
statistics in the future.) Second, many of the conflicts 
that are automatically resolved are easily handled using 
a program, but hard to resolve by hand. If they were 
not automatically resolved, they would require a user 
to invoke a tool that might equally well be invoked 
automatically. Directories and binary data are exam- 
ples. Third, further effort applied to writing resolvers 
certainly would decrease the observed rate of conflicts 
even more. 

The case for automatic conflict resolution in less 
connected environments is even stronger. Environ- 
ments in which disconnection is more even common 
than our home use environment, such as mobile com- 
puting, can be expected to have higher conflict rates. 
Our data suggests that conflicts in this environment 
are often easier to reconcile than those in the office 
environment. Decreasing the observed conflict rate 
by sixfold for a replicated home use environment is a 
major improvement. 

Many files have semantics allowing fairly simple 
resolution of all conflicts. Even when not all possible 
conflicts a file can experience are automatically resolv- 
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able, there are often large classes of conflicts that can 
be fixed without human attention. Unix-style directo- 
ries are one such example, where all conflicts except 
name conflicts can be automatically resolved. In sev- 
eral other cases, we have discovered that solutions that 
solve 80% or so of all possible problems work very 
well. The user need only be informed in the case of 
the 20% that cannot be resolved. In some cases, the 
resolution of the difficult set of conflicts can even be 
guessed at, with the user only becoming aware of the 
difficulty if the guess is wrong. .newsrc and conflict 
log files are two such cases. 


Implementing data storage as directories offers an 
opportunity to leverage the Ficus directory resolution 
algorithms. When data follows insert/delete semantics 
(such as Ficus graft points do) this mapping is quite 
natural. In the future, we plan to restructure the direc- 
tory reconciliation algorithms as a library that can be 
used in more general situations. 


Reconciliation chooses resolvers first by file name, 
applying consecutive resolvers until one succeeds. 
Storing the file type as an attribute of the file would be a 
more attractive approach. Existing Unix file attributes 
leave little room for such information, but an object- 
oriented file system with a general purpose attribute ser- 
vice could store a resolver list as an attribute. The rec- 
onciliation process could then directly call the proper 
resolvers for each conflict. Such an object-oriented file 
system is under development in our project, and will be 
tested with resolver attributes. Until all data is stored 
as typed objects, the approach discussed in this paper 
offers an attractive interim solution. 


While this work has been applied to a Unix-style file 
system, most of it is not specific to Unix systems. The 
general approach is applicable to many other systems 
and could be simplified on systems that don’t allow 
multiple names for the same file. The approach of pair- 
wise resolution of single conflicting files, name-based 
choice of resolvers, and iteratively invoking conflict 
resolvers until one of them succeeds appears to be gen- 
erally applicable. 


In conclusion, our experience with conflicts in opti- 
mistically replicated file systems is that, for one com- 
mon environment, conflicts are rare. At least two thirds 
of those conflicts that do occur can be resolved automat- 
ically, with no user intervention or even notification. 
Further effort in building more resolvers would reduce 
the rate of user notification of conflicts even lower. Our 
experience with working on the Ficus system is that the 
typical user is not bothered by either the possibility of 
conflicts or their actual occurrence. 
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Abstract 


Despite impressive advances in file system throughput 
resulting from technologies such as high-bandwidth 
networks and disk arrays, file system latency has not 
improved and in many cases has become worse. Con- 
sequently, file system I/O remains one of the major 
bottlenecks to operating system performance [10]. 

This paper investigates an automated predictive 
approach towards reducing file latency. Automatic 
Prefetching uses past file accesses to predict future 
file system requests. The objective is to provide data in 
advance of the request for the data, effectively masking 
access latencies. We have designed and implement a 
system to measure the performance benefits of auto- 
matic prefetching. Our current results, obtained from 
a trace-driven simulation, show that prefetching results 
in as much as a 280% improvement over LRU espe- 
cially for smaller caches. Alternatively, prefetching 
can reduce cache size by up to 50%. 


1 Motivation 


Rapid improvements in processor and memory speeds 
have created a situation in which I/O, in particular file 
system I/O, has become the major bottleneck to operat- 
ing system performance [10]. Recent advances in high 
bandwidth devices (e.g., RAID, ATM networks) have 
had a large impact on file system throughput. Unfor- 
tunately, access latency still remains a problem and is 
not likely to improve significantly due to the physical 
limitations of storage devices and network transfer la- 
tencies. Moreover, the increasing popularity of certain 
file system designs such as RAID, CDROM, wide area 
distributed file systems, wireless networks, and mo- 
bile hosts has only exacerbated the latency problem. 
For example, distributed file systems experience net- 
work latency combined with standard disk latency. As 
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distributed file systems scale both numerically and ge- 
ographically, as envisioned by the Andrew File System 
designers [7], network delays will become the dom- 
inant factor in remote file system access. Similarly, 
local file systems built on technologies like CD-ROMs 
also suffer from very high latencies but continue to in- 
crease in popularity due to the large amount of storage 
space they offer. 

Although a variety of high bandwidth technologies 
are now available, it is unlikely that existing (and 
emerging) low-end technologies such as serial lines 
running SLIP or PPP, 64/128 Kb ISDN and other slower 
speed networks will disappear in the near future given 
their low-cost and wide-spread use. Such communica- 
tion technologies suffer from both high latencies and 
low bandwidths. Distributed file systems that build on 
or incorporate these technologies will experience la- 
tencies substantially higher than that of conventional 
file systems. However, the appeal of low-cost widely 
available shared access to files will certainly prolong 
the existence of such file systems, despite their poor 
performance. 

The goal of our research is to investigate methods 
for successfully reducing the the perceived latency as- 
sociated with file system operations. In this paper, we 
describe a new method for masking file system latency 
called automatic prefetching. Automatic prefetching 
takes a heuristic-based approach using knowledge of 
past accesses to predict future access without user or 
application intervention. As a result, applications au- 
tomatically receive reduced perceived latencies, better 
use of available bandwidth via batched file system re- 
quests, and improved cache utilization. 


2 Related work 


Both caching and prefetching have been used in a vari- 
ety of settings to improve performance. The following 
briefly describes related work involving caching and 
prefetching to improve file system performance. 
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2.1 Caching 


Caching has been used successfully in many systems 
to substantially reduce the amount of file system I/O 
[16, 6, 8, 1]. Despite the success of caching, it is pre- 
cisely the accesses that cannot be satisfied from the 
cache that are the current bottleneck to file system per- 
formance [10]. Unfortunately, increasing the cache 
size beyond a certain point only results in minor per- 
formance improvements. Experience shows that the 
relative benefit of caching decreases as cache size (and 
thus cache cost) increases [9, 8]. There exists a thresh- 
old beyond which performance improvements are mi- 
nor and prohibitively expensive. Moreover, studies 
show that the “natural” cache size or threshold is be- 
coming a substantially larger fraction (one forth to one 
third) of the total memory, due in part to larger files 
(e.g., big applications, databases, video, audio, etc.) 
[2]. Consequently, new methods are needed to reduce 
the perceived latency of file accesses and keep cache 
sizes in check. 

Although machines with large memories are now 
available, low-end workstations, PCs, mobile lap- 
tops/notebooks, and now PDAs (personal data assis- 
tants) with limited memory capacities enjoy wide- 
spread use. Because of cost or space constraints these 
machines cannot support large file caches. The desire 
for smaller portable machines combined with continu- 
ally increasing files size means that large caches cannot 
be assumed to be the complete solution to the latency 
problem. 

Finally, as a result of rapid improvements in band- 
width, cache miss service times are dominated by la- 
tency. Note that: 


e¢ Most files are quite small. In fact, measurements 
of existing distributed file systems show that the 
average file is only a few kilobytes long [9, 2]. 
For files of this size, transmission rate is of lit- 
tle concern when compared to the access latency 
across a WAN or from a slow device. As a result, 
access latency, not bandwidth, becomes the dom- 
inate cost for references to files not in the cache. 


e In many distributed file systems, the open() and 
close() functions represent synchronization points 
for shared files. Although the file itself may reside 
in the client cache, each open() and close() call 
must be executed at the server for consistency 
reasons. The latency of these calls can be quite 
large, and tends to dominate other costs, even 
when the file is in the file cache. 


In short, the benefits of standard caching have been 
realized. To improve file system performance further 


and keep file cache sizes in check, caching will need to 
be supplemented with new methods and algorithms. 


2.2 Prefetching 


The concept of prefetching has been used in a va- 
riety of environments including microprocessor de- 
signs, virtual memory paging, databases, and file read 
ahead. More recently, long term prefetching has been 
used in file systems to support disconnected operation 
[14, 15, 5]. Prefetching has also been used to improve 
parallel file access on MIMD architectures [4]. 

One relatively straight forward method of prefetch- 
ing is to have each application inform the operating 
system of its future requirements. This approach has 
been proposed by Patterson et. al. [11]. Using this ap- 
proach, the application program informs the operating 
system of its future file requirements, and the operating 
system then attempts to optimize those accesses. The 
basic idea is that the application knows what files will 
be needed and when they will be needed. 

Application directed prefetching is certainly a step 
in the right direction. However, there are several draw- 
backs to this approach. Using this approach, applica- 
tions must be rewritten to inform the operating system 
of future file requirements. Moreover, the program- 
mer must learn a reasonably complex set of additional 
system directives that must be strategically deployed 
throughout the program. This implies that the appli- 
cation writer must have a thorough understanding of 
the application and its file access patterns. Ironically, a 
key goal of many recent languages, in particular object- 
oriented languages, is abstraction and encapsulation; 
hiding the implementation details from the program- 
mer. Even when the details are visible, our experience 
indicates that the enormity and complexity of many 
software systems creates a situation in which experts 
may have difficulty grasping the complete picture of 
file access patterns. Moreover, incorrectly placed di- 
rectives or an incomplete set of directives can actually 
degrade performance rather than improve it. 

A second problem is that the operating system needs 
a significant lead-time to insure the file is available 
when needed. Therefore, in order to benefit from 
prefetching, the application must have a significant 
amount of computation to do between the time the file 
is predicted and the time the file is accessed. However, 
many applications do not know which files they will 
need until the actual need arises. For instance, the pre- 
processor of a compiler does not know the pattern of 
nested include files until the files are actually encoun- 
tered in the input stream, nor will an editor necessarily 
know which files a user normally edits. Our approach 
attempts to solve this problem by predicting the need 


for a file well in advance of when the application could; 
in some cases long before the application even begins 
to execute. 

A third problem with application driven prefetching 
arises in situations where related file accesses span mul- 
tiple executables. Typically applications are written in- 
dependently and only know file access patterns within 
the application. In situations where a series of applica- 
tions execute repeatedly, like an edit/compile/run cycle, 
or certain commonly run shell scripts, no one applica- 
tion knows the cross-application file access patterns, 
and therefore cannot inform the operating system of a 
future application’s file requirements. In some cases, 
batch-type utilities, such as the Unix make facility, can 
be instrumented to understand cross-application access 
patterns. However, even in this case, a complete view 
of the real cross application pattern is often unknown to 
the user or requires extreme expertise to determine the 
pattern. Our approach uses long term history informa- 
tion to support prefetching across application bound- 
aries. 


3 Automatic Prefetching 


We are investigating an approach we call automatic 
prefetching, in which the operating system rather 
than the application predicts future file requirements. 
The basic idea and hypothesis underlying automatic 
prefetching is that future file activity can be success- 
fully predicted from past file activity. This knowledge 
can then be used to improve overall file system perfor- 
mance. 

Automatic prefetching has several advantages over 
existing approaches. First, existing applications do not 
need to be rewritten or modified, nor do new appli- 
cations need to incorporate non-portable prefetching 
operations. As a result, all applications receive the 
benefits of automatic prefetching, including existing 
software. Second, because the operating system au- 
tomatically performs prefetching on the application’s 
behalf, application writers can concentrate on solving 
the problem at hand rather than worrying about opti- 
mizing file system performance. Third, the operating 
system monitors file access across application bound- 
aries and can thus detect access patterns that span mul- 
tiple applications executed repeatedly. Consequently, 
the operating system can prefetch files substantially 
earlier than the file is actually needed, often before the 
application even begins to execute. 

Automatic prefetching allows the operating system 
effectively to overlap processing with file transfers. 
The operating system can also use past access infor- 
mation to batch together multiple file requests and thus 
make better use of available bandwidth. Past access in- 


formation can also be used to improve the cache man- 
agement algorithm, effectively reducing cache misses 
even if no prefetching occurs. 

The first goal of our research was to determine 
whether such an approach is viable. Our second goal 
was to develop effective prefetch policies and quantify 
the benefits of automatic prefetching. The following 
sections consider each of these objectives and describe 
our results. 


4 Analysis of Existing Systems 


To determine the viability of automatic prefetching, we 
analyzed current file system usage patterns. Although 
other researchers have gathered file system traces [9, 2], 
we decided to modify the SunOS kernel in order to 
gather our own traces that extract specific information 
important to our research. In addition to recording all 
file system calls made by the system, the kernel gathers 
precise information regarding the issuing process and 
the timing for every operation. The timing information 
not only serves as an indicator of the system’s perfor- 
mance, but it also provides information as to whether 
prefetching can have any substantial effects on perfor- 
mance. 

We gathered a variety of traces, including the normal 
daily usage of several researchers, and also various 
synthetic workloads. Traces were collected on a single 
Sun Sparcstation supporting several users executing a 
variety of tasks. Traces were collected for varying time 
periods with the longest traces spanning more than 10 
days and containing over 500,000 operations. Users 
were not restricted in any way. Typical daily usage 
included users processing email, editing, compiling, 
preparing documents and executing other task typical 
of an academic environment. This particular set of 
traces contains almost no database activity. The data 
we collected appears to be in line with that of other 
studies [9, 2] given similar workloads. 

Our initial analysis of the trace data indicates that 
typical file system usage can realize substantial per- 
formance improvements from the use of prefetching, 
and also provides several guidelines for a successful 
prefetching policy. 

First, the data shows that there is relatively little time 
between the moment when a file is opened and the 
moment when the first read occurs (see figure 1). In 
fact, the median time for our traces was less than three 
milliseconds. Consequently, prefetching must occur 
significantly earlier than the open operation to achieve 
any significant performance improvement. Prefetching 
at open time will only provide minor improvements. 

Second, the data shows that the average amount 
of time between successive opens is substantial (200 
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Figure 1: Histogram of times between open and first read of a file. 


ms). If the operating system can accurately predict the 
next file that will be accessed, there exists a sufficient 
amount of time to prefetch the file. 

In a multi-user, multiprogramming environment, 
concurrently executing tasks may generate an inter- 
leaved stream of file requests. In such an environment, 
reliable access patterns may be difficult to obtain. Even 
when patterns are discernable, the randomness of the 
concurrency may render the prefetching effort inef- 
fective. However, analysis of trace data consisting of 
multiple users (and various daemons) shows that even 
in a multiprogramming environment accesses tend to 
be “sequential” where we define sequential as a sen- 
sible/predictable uninterrupted progression of file ac- 
cesses associated with a task. In fact, measurements 
show that over 94% of the accesses follow logically 
from the previous access. Thus multiprogramming 
seems to have little effect on the ability to predict the 
next file referenced. 


5 The Probability Graph 


We have designed and implemented a simple analyzer 
that attempts to predict future accesses based on past 
access patterns. Driven by trace data, the analyzer 
dynamically creates a logical graph called a Probability 
Graph. Each node in the graph represents a file in the 
file system. 

Before describing the probability graph, we must de- 


fine the lookahead period used to construct the graph. 
The lookahead period defines what it means for one file 
to be opened “soon” after another file. The analyzer 
defines the lookahead period to be a fixed number of 
file open operations that occur after the current open. 
If a file is opened during this period, the open is consid- 
ered to have occurred “soon” after the current open. A 
physical time measure rather than a virtual time mea- 
sure could be used, but the above measure is easily 
obtained and can be argued to be a better definition 
of “soon” given the unknown execution times and file 
access patterns of applications. Our results show that 
this measure works well in practice. 


We say two files are related if the files are opened 
within a lookahead period of one another. For example, 
if the lookahead period is one, then the next file opened 
is the only file considered to be related to the current 
file. If the lookahead period is five, then any file opened 
within five files of the current file is considered to be 
related to the current file. 


The analyzer allocates a node in the probability 
graph for each file of interest in the file system. Unix 
exec system calls are treated like opens and thus are 
included in the probability graph. One graph, derived 
from the trace described in section 7, generated ap- 
proximately 6,500 nodes accessed over an eight day 
period. Each node consumes less than one hundred 
bytes, and can be efficiently stored on disk in the inode 
of each associated file, with active portions cached for 


better performance. Our current graph storage scheme 
has not been optimized and thus is rather wasteful. We 
have recently begun investigating methods that will 
substantially reduce the graph size via graph pruning, 
aging, and/or compression. 

Arcs in the probability graph represent related ac- 
cesses. If the open for one file follows within the 
lookahead period of the open for a second file, a di- 
rected arc is drawn from the first to the second. Larger 
lookaheads produce more arcs. The analyzer weighs 
. each arc by the number of times that the second file is 
accessed after the first file. Thus, the graph represents 
an ordered list of files demanded from the file system, 
and each arc represents the probability of a particular 
file being opened soon after another file. 

Figure 2 illustrates the structure of an example prob- 
ability graph. The probability graph provides the in- 





Figure 2: Three nodes of an example probability graph. 


formation necessary to make intelligent prefetch de- 
cisions. We define the chance of a prediction being 
correct as the probability of a file (say file B) being 
opened given the fact that another file (file A) has been 
opened. The chance of file B following file A can be 
obtained from the probability graph as the ratio of the 
number of arcs from file A to file B divided by the total 
number of arcs leaving file A. We say a prediction is 
reasonable if the estimated chance of the prediction is 
above a tunable parameter minimum chance. We say 
a prediction is correct if the file predicted is actually 
opened within the lookahead period. 

Establishing a minimum chance requirement is cru- 
cial to avoid wasting system resources. In the absence 
of aminimum requirement, the analyzer would produce 
several predictions for each file open, consuming net- 
work and cache resources with each prediction, many 
of which would be incorrect. 
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To measure the success of the analyzer we define an 
accuracy value. The accuracy of a set of predictions is 
the number of correct predictions divided by the total 
number of predictions made. The accuracy will almost 
always be at least as large as the minimum chance, and 
in practice is substantially higher. 

The number of predictions made per open call varies 
with the required accuracy of the predictions. Re- 
quiring very accurate predictions (predictions that are 
almost never wrong) means that only a limited number 
of predictions can be made. For one set of trace data, 
using arelatively low minimum chance value (65%) the 
predictor averaged 0.45 files predicted per open. For 
higher minimum chance values (95%) the predictor av- 
eraged only 0.1 files predicted per open. Even when 
using arelatively low minimum chance (e.g., 65%), the 
predictor was able to make a prediction about 40% of 
the time and was correct on approximately 80% of the 
predictions made. 

Figure 3 shows the distribution of estimated chance 
values with a lookahead of one. The distribution shows 
that a large number of predictions have an estimated 
chance of 100%. Setting the minimum chance less 
than 50% places the system in danger of prefetching 
many unlikely files. By setting the minimum chance at 
50%, very few files that should have been prefetched 
will be missed. Moreover, the distribution shows how 
alow minimum chance can still result in a high average 
accuracy. 


6 A Simulation System 


To evaluate the performance of systems based on au- 
tomatic prefetching, we implemented a simulator that 
models a file system. In order to simulate a variety 
of file system architectures having a variety of perfor- 
mance characteristics, the simulator is highly parame- 
terized and can be adjusted to model several file system 
designs. This flexibility allows us to measure and com- 
pare the performance of various cache management 
policies and mechanisms under a wide variety of file 
system conditions. The simulator consists of four basic 
components: a driver, cache manager, disk subsystem, 
and predictor. 

The driver reads a timestamped file system trace and 
translates each file access into a file system request for 
the simulator to process. Because the driver generates 
file requests directly from the trace data, the workload 
is exactly like that of typical (concurrent) user-level 
applications. However, the driver must modify the 
set of requests in a few special cases. Because the 
simulator is only interested in file system I/O activity, 
the driver removes accesses made to files representing 
devices such as terminals or /dev/null. References to 
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Figure 3: Histogram of estimated chances given a lookahead of one. 


certain standard shared libraries such as the C library 
are also eliminated. Accesses (e.g., mmap() calls) to 
these libraries rarely require any file system activity, 
since they are typically already present in the virtual 
memory cache. 


The cache manager manages a simulated file cache 
and services as many requests as possible from the 
cache without invoking the disk subsystem. We have 
implemented two cache managers. The first is a stan- 
dard LRU cache manager, where disk pages are re- 
placed in the order of least recent use. The second 
cache manager is the prefetch cache manager. The 
prefetch cache manager operates much like the LRU 
manager, updating timestamps on each access and re- 
placing the least recently used page. However, the 
prefetch manager also updates timestamps based on 
knowledge of expected accesses from the predictor, 
thus rescuing some-soon-to-be-accessed pages from 
replacement. We have found that prefetch cache man- 
agement can improve performance even if no prefetch- 
ing occurs (i.e., no pages are actually brought in ahead 
of time). When run in prefetch mode, the simulator 
shows that anywhere between 5% and 30% of the per- 
formance improvement comes from pages that were 
rescued rather than actually being prefetched. 


The task of the disk subsystem is to simulate a file 
storage device. The current disk subsystem has been 
configured to emulate local disks. Local disk have rel- 
atively low latency when compared to our other target 


file systems (e.g., wide area distributed file systems, 
CDROMs, RAIDs, or wireless networks). Conse- 
quently, we expect that the performance improvements 
realized with a local disk model will only be amplified 
in our other target environments. In the following tests, 
we assumed a disk model with a first access latency of 
15 ms and a transfer rate of 2 MB/sec after factoring in 
typical file system overhead. 

Finally, the simulator contains a predictor. The 
predictor observes open requests that arrive from the 
driver, and records the data in the probability graph 
described earlier. The predictor builds the probability 
graph dynamically just as it would be done in a real 
system. The longer the simulator executes, the wiser it 
becomes. On each access the simulator gains a clearer 
understanding of the true access patterns. 

During each open, the probability graph is examined 
for prefetch opportunities. If an opportunity is discov- 
ered, then aread request is sent to the cache manager. If 
the cache contains the appropriate data, then the data’s 
access time is set to the current time. This ensures 
that the data will be present for the anticipated need, 
and possibly rescues the data from an impending flush 
from the cache. If the prefetch request cannot be satis- 
fied from the cache, then it is prefetched from the disk 
subject to the characteristics of the disk subsystem. 

Notice that the current disk subsystem does no re- 
ordering of requests. In particular, it does not preempt 
or defer prefetch requests to satisfy subsequent appli- 
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Figure 4: Histogram of times between prefetch and first read access. 


cation requests. Reordering and prioritizing requests 
represents an area of further potential performance im- 
provements. 


We are currently in the process of implementing the 
automatic prefetching system inside a Unix kernel run- 
ning NFS to measure performance on an actual system. 


7 Experimental Results 


We performed several tests to measure the performance 
improvements achieved by automatic prefetching. For 
the particular set of tests described below, a trace taken 
over an eight day period containing the unrestricted 
activity of multiple users was used. To determine the 
performance benefits of prefetching, we ran several 
simulations varying the cache size, lookahead value, 
and minimum chance and also measured the LRU per- 
formance in each case for comparison purposes. 


Recall from section 4, that the time between the open 
of a file and the first read is too small for prefetching to 
be effective. Figure 4 shows that the simulator is able 
to predict and begin prefetching files sufficiently far in 
advance of the first read to the file. Our measurements 
indicate that 94% of the files that were predicted and 
then subsequently access were prefetched more than 
20 ms before the actual need, resulting in cache hits at 
the time of the first read. 


7.1 Prefetch Parameters Effect on Perfor- 
mance 


Two parameters that significantly affect the predictions 
made by the predictor are the lookahead and minimum 


_ chance values. 


Recall that the lookahead represents how close two 
file opens need be for the files to be considered related. 
Setting this value very large increases the number of 
files that are considered related to each other, and there- 
fore each file open may potentially cause several other 
files to be prefetched. 

Large lookaheads increase the number of files 
prefetched since more predictions are made in response 
to each open request. Moreover, large lookaheads re- 
sult in files being prefetched substantially earlier, be- 
cause predictions can be made much further in ad- 
vance. As a result, large lookaheads are inappropriate 
for smaller cache sizes, but often perform very well 
with larger caches! . In the case of small caches, large 
lookaheads tend to prefetch files too far in advance of 
the need. As aresult, data necessary to the current com- 
putation may be forced out of the cache and replaced 


'Here we use the terms “small” and “large” as relative measures 
of cache size where the meaning of “small” and “large” depend on 
the workload. A “small cache” will have many cache misses while 
a “large cache” will have few misses. For the workload in this trace, 
caches of one megabyte or less would be considered small while 
caches of three megabytes or more would be considered large. Other 
traces would produce different values. 
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Figure 5: Cache misses as function of lookahead and MinChance for a 400K cache. Performance varies by as much 
as 13% (between 43% and 56%) depending on the lookahead and minchance settings. 
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Figure 6: Cache misses as function of lookahead and MinChance for a 4M cache. Performance varies by as much as 
2% (between 9% and 11%) depending on the lookahead and minchance settings. 
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Table 2: Data points corresponding to Figure 6. 


by (useless) data needed far in the future. However, 
for larger cache sizes, the cache may have sufficient 
space to load in file data required in the future without 
disturbing the file data required by the current compu- 
tation. 


MinChance is the minimum estimated probability 
that a given file will be needed in the near future. 
For larger cache sizes smaller MinChance values per- 
form better. Setting the MinChance low results in 
aggressive prefetching. When the cache is large, in- 
correct prefetches have minimal affect on overall per- 
formance. Somewhat surprisingly, an aggressively low 
MinChance value benefits small caches as well. Be- 
cause the hit rate is low for small caches, correct pre- 
dictions result in large performance benefits. A low 
minimum chance increases the total number of cor- 
rect predictions. For moderate cache sizes, the optimal 
MinChance is a function of the specific cache size and 
must limit the number of missed prefetch opportunities 
without prefetching unnecessary files. 


In summary, MinChance should be low (aggressive) 
for both large and small caches, but higher for inter- 
mediate size caches. Lookahead should increase with 
increasing cache size. Figures 5 and 6 and their asso- 
ciated tables, tables 1 and 2, illustrate these tradeoffs 


10, 1994 - Boston, MA 


for a 400 KB cache and a 4000 KB cache respectively. 
Clearly, the Lookahead and MinChance parameters are 
highly sensative to the cache size and must be adjusted 
in accordance with the cache size. Moreover, mul- 
tiple settings for a particular cache size may result in 
approximately equal miss ratios. In this case, other fac- 
tors such as network congestion and processing over- 
head can be used to aid in the selection of appropriate 
parameter settings. 


7.2 Performance Compared to LRU 


The primary goal of automatic prefetching is to bring 
necessary file data into the cache before it is needed. 
If automatic prefetching is successful we would expect 
the number of cache misses to be less than the number 
of cache misses experienced under standard LRU cache 
management. 

Figure 7 shows the number of page misses that the 
file system incurred under LRU and under prefetching 
for various cache sizes. After tuning the above parame- 
ters, prefetching performs better than LRU for all cache 
sizes, in some cases outperforming LRU by as much as 
280%. Also note that for the cache sizes shown here, 
prefetching provided the same or better performance 
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Figure 7: Cache misses as a function of cache size. 


than LRU using a cache half the size. This is partic- 
ularly important for machines that do not have large 
amounts of memory available for file caching. Even 
for large memory machines, the ability to achieve sim- 
ilar performance using smaller cache sizes results in 
more memory for applications. This also indicates that 
the number of correctly prefetched pages more than 
offsets any pages incorrectly forced out of the cache by 
prefetching, even for small cache sizes. 

For this particular trace, both LRU and prefetching 
realize relatively little improvement in the miss ratios 
for caches larger than 4 MB*. However, although LRU 
performance begins to approach prefetch performance 
as cache size increases, simulations out to cache sizes 
of 20 MB still show that prefetching results in an 11% 
reduction in the number of misses as compared to LRU. 


8 Conclusions 


Our results show that reasonable predictions can be 
made based on past file activity. As a result, auto- 
matic prefetching can substantially reduce I/O latency, 
make better use of the available bandwidth via batched 
prefetch requests, and improve cache utilization. As 
wide area distributed file systems, CDROM, RAID, 


2Like the traces reported in (2], this particular trace consisted of 
unrestricted real user usage. However, unlike the traces in [2], this 
trace contained no “heavy users” and thus can achieve reasonable 
miss rates with a 4 MB cache. 


and other high latency/high bandwidth systems become 
prevalent, prefetching will become an increasingly im- 
portant mechanism toward high-performance I/O. 
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Operating-System Support for Distributed Multimedia 
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Abstract 


Multimedia applications place new demands upon 
processors, networks and operating systems. While 
some network designers, through ATM for example, 
have considered revolutionary approaches to support- 
ing multimedia, the same cannot be said for operating 
systems designers. Most work is evolutionary in na- 
ture, attempting to identify additional features that can 
be added to existing systems to support multimedia. 
Here we describe the Pegasus project’s attempt to build 
an integrated hardware and operating system environ- 
ment from the ground up specifically targeted towards 
multimedia. 


1 Introduction 


Since the invention of electronic computers in the for- 
ties, every decade has been characterized by new ways 
in which they were used. In the fifties, people used 
sign-up sheets to reserve the computer for an hour’s 
work; in the sixties batch processing was introduced; 
time sharing became pervasive in the seventies; the PC 
and networking came in the eighties; and now, in the 
nineties, we see the introduction of multimedia. 

These days, every self-respecting computer vendor 
sells computers with some form of multimedia support. 
Some workstations now have cameras built into them, 
PCs come with multimedia applications, even game 
computers now make use of CD-I. From a research 
viewpoint, multimedia seems to be a solved problem; 
can’t we see the wonderful demonstrations from every 
vendor? 

We argue that the multimedia applications on most 
systems today are inflexible, they more or less take 
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over the machine and cannot be combined with other 
applications. 

Multimedia, we claim, is only real if the different 
media are treated with equal respect. Audio and video 
should not be second-class media on which the only op- 
erations are capture, storage and rendering, but media 
that can be processed — analysed, filtered, modified — 
just like text and data. This processing should not be a 
privilege of dedicated operating-system processes, but 
should be possible to do, interactively, with ordinary 
applications. 

Existing multimedia systems do not have this ability. 
For example, on typical PC platforms, multimedia ap- 
plications run in real time but take over the machine; on 
Unix platforms, multimedia applications co-exist with 
other applications, but they hardly run in real time. 
Sometimes, dedicated hardware can capture and ren- 
der multimedia in real time, but the data is far removed 
from the processor so that no processing is possible. 

The value of audio and video depends critically on 
the ability both to process and to render them in real 
time. This is hard. The value of interactive audio and 
video additionally depends on being able to capture, 
process and render it with fraction-of-a-second end-to- 
end latency. This is even harder. 


In the Pegasus project, groups at the University of 
Cambridge Computer Laboratory and the University 
of Twente Faculty of Computer Science are rising to 
the challenge of providing architectural and operating- 
system support for distributed multimedia applications. 

Pegasus is a European Communities’ Esprit! 
project which is now halfway through its three-year 
funding period. 

The goal of Pegasus is to create the architecture for 
a general-purpose distributed multimedia system and 


IThe Pegasus Project is supported by EsPRIT BRA project 6586 
and partially supported by the Cambridge Olivetti Research Labora- 
tory and a grant from Digital Equipment Corporation. 
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Figure 1: Architecture of the multimedia workstation 


build an operating system for it that supports multime- 
dia applications. A few specific applications will be 
implemented in order to prove the practicality of the 
system. 


The architecture consists of: multimedia worksta- 
tions; general-purpose and special-purpose multime- 
dia processing servers; a single storage service for all 
types of data; and Unix boxes as the platform for the 
non-real-time control part of multimedia applications 
and applications unrelated to multimedia. All of the 
components are connected through an ATM network, 
which provides the bandwidth and can provide latency 
guarantees for interactive multimedia data. Multime- 
dia capture and rendering devices are connected di- 
rectly to this network, rather than being connected to, 
for example, workstation buses. This architecture is 
explained in Section 2. 


The operating system support in Pegasus consists of 
a microkernel, named Nemesis, that supports a single 
address space with multiple protection domains, and 
multiple threads in each domain. There is scheduler 
support for processing multimedia data in real time. 
Nemesis has a minimal operating-system interface; it 
does not — at least, not now — have a Unix inter- 
face. However, processes on Nemesis can be created, 
be controlled by, and communicate with, processes on 
Unix. We expect multimedia applications to consist 
of symbiotic processes on Nemesis and Unix, where 
user interface and application control will be provided 


by the Unix part, and real-time multimedia processing 
by the Nemesis part. Later, perhaps as part of an- 
other project, parts of the Nemesis functionality could 
be ported to a general-purpose operating system, or a 
Unix emulation provided over Nemesis. Nemesis is 
described in Section 3. 


System services are viewed as objects: abstract data 
types accessed through their methods. When invoker 
and object share a protection domain, method invo- 
cation is through procedure call; when they share a 
machine, and thus an address space, invocation takes 
place through a protected call, or ‘local remote pro- 
cedure call’; when they are on different machines, in- 
vocation goes via remote procedure call. Objects are 
located using a distributed name service. The name 
space is global only in the sense that every entity, in 
principle, can name any object in the universe; it is not 
global in the sense that there is one root to the name 
space, or that one name identifies the same object any- 
where. Each protection domain contains a local name 
server which maintains connections with name servers 
elsewhere. The name server assists in establishing the 
appropriate channels through which local and remote 
objects are invoked. The name server is described in 
Section 4. 


The Pegasus File Server is a log-structured file ser- 
vice designed to store and retrieve multimedia files in 
real time and to scale to a very large size. Scaling the 
file-server design up to terabyte capacity has forced 
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Figure 2: Principle of the ATM Camera 


us to redesign the log-structured file-system structures 
as they occur in Sprite or BSD4.4. The Pegasus File 
Server uses a buffering and storage strategy that pre- 
vents loss of data in case of failure of a single com- 
ponent. The Pegasus File Service is described in Sec- 
tion 5. 


2 Systems Architecture 


In this section, we will show and explain the unusual 
architecture of the Pegasus system. The system con- 
sists of workstations and servers, interconnected by an 
ATM network. We use an ATM network as it can pro- 
vide high bandwidth and low latency. ATM networks 
can scale gracefully to large sizes and link bandwidths 
and very large aggregate bandwidths. 

Multimedia systems need special hardware for in- 
put and output of digital audio and video. Once digi- 
tized, video and audio streams must be transported to 
where they are processed, stored or rendered. Video 
requires substantial, but not staggering bandwidths: 
using frame-by-frame compression, for instance with 
JPEG, a video stream requires no more than a megabyte 
per second. Modern networks can easily provide this 
bandwidth. Using compression methods that compress 
groups of frames, such as MPEG, much higher com- 
pression can be reached, albeit at the cost of higher end- 
to-end latency. Audio has modest bandwidth require- 
ments compared to video, but is much more susceptible 
to jitter, that is, the irregularities in the transport and 
processing times. 

For smooth and efficient handling of interactive dig- 
ital audio and video, the paths between origin and 
destination must be as short as possible. Gratuitous 
processing and transportation increase the end-to-end 
latency and hence decrease the quality. Thus, it is 
desirable that audio and video data are not handled 
by operating-system and application code except when 
application-specific processing is being carried out. 

Figure 1 shows an important aspect of the Pegasus 
architecture — the target end-system architecture. The 
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figure shows a conventional workstation and its net- 
work interface connected to an ATM switch. However, 
also connected to the switch we see a camera device, 
a display device, an audio device, and then the rest 
of the ATM network. The important point is that the 
switch is under control of the workstation; that is, all 
connections through the switch are managed by the 
workstation, so that the workstation is also in control 
of the multimedia devices. 

This setup is much like that of the Desk-Area Net- 
work (Hayter and McAuley [1991]). However, in areal 
DAN, an ATM switch fabric actually forms the central 
backbone of the workstation itself; CPU, memory and 
devices all communicate via the switch. The Pegasus 
project, partly because of its time frame of only three 
years, uses a conventional bus-based architecture for 
its processor devices, but uses the DAN mechanism for 
connecting multimedia devices. 

In this architecture, when video flows from a camera 
in one system to a display in another — as is the case 
in video-phone and video-conferencing applications — 
no processors need to process any video data. This goes 
for the audio data too, of course. Hence the processors 
in the workstations, at both the camera and display, 
only need to manage the connections and devices. 


2.1 Some ATM Devices 


This section briefly describes the ATM devices used 
by the Pegasus project to provide a multimedia plat- 
form. More details of the DAN devices are available 
in Barham et al. [1994]. 

The ATM camera (Pratt [1993]), directly produces 
digital video as a stream of ATM cells. The principle of 
the ATM camera is schematically depicted in Figure 2. 
Scan-lines of video are digitized and when eight lines 
have been buffered, they are encoded as tiles, rectangles 
of 8 x 8 pixels. A number of tiles are packed into the 
payload of an AALS frame together with a trailer that 
provides the x and y coordinates of the tiles with respect 
to the video frame, and a time stamp that identifies the 
frame that the tile belongs to. 
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Figure 3: Architecture of the ATM display 


Cameras can be equipped with one or more com- 
pression devices. The device to be used is identified 
when the virtual circuit is established. Currently, both 
raw video and motion JPEG are supported. Using 
AALS allows interaction with standard AALS imple- 
mentations and offers protection against rendering or 
decompressing faulty tiles. 

The version of the ATM camera now in production 
also includes audio capture capability. 

The ATM display, shown in Figure 3, implements a 
single primitive, that of displaying arriving pixel tiles 
on incoming virtual circuits to windows on the screen. 
The virtual-circuit identifier (VCI) is used as an index 
into a table of window descriptors; each window de- 
scriptor has an x and y offset from the top-left-hand 
corner of the display, and clipping information. By 
manipulation of these contexts, a window manager can 
control which virtual channel, and thus which process, 
can access the different pixels of the screen. 

Incoming data can be coded as compressed or un- 
compressed tiles. Note that as tiles essentially repre- 
sent bit-blit operations of fixed size, from the viewpoint 
of a display, there is a unification of video and graphics. 
The code in conventional window systems that does 
the multiplexing of windows to the display can largely 
disappear; the multiplexing is done via the display’s 
window descriptors. The window manager, exerting 
its control over the creation and modification of these 
descriptors, can create windows on screen, move them, 
resize them, iconize them and raise or lower them. It 
can also use a window descriptor that allows it to write 
the whole screen for decorating windows with title bars 
and resize buttons. 


While the hardware for the display is under develop- 
ment, software emulation using a DS5000-25 is being 
used. 

Finally, there is an ATM DSP node which combines 
digital signal processing and audio input and output. 
This device contains DACs and ADCs and packs and 
unpacks audio samples into ATM cells. Each such cell 
also contains a time stamp. 

Our experience so far indicates that ATM devices are 
simple to construct and that they allow a natural com- 
bination of video data and graphic data on a display. 
The use of tiles for video reduces latency in several 
places from a ‘frame time’ (33 or 40 ms) to a ‘tile time’ 
(30 to 40 us). Since latencies tend to add up, this is an 
important reduction. 


2.2 Control Protocol 


Multimedia devices generate two streams of data on 
two distinct virtual circuits. One is the actual data 
stream which was cursorily described above. The 
other is a control stream; this is a bi-directional low- 
bandwidth stream that is used to control the device and 
for purposes of synchronization. 

Both data and control virtual circuits are established 
through the normal mechanism of ATM signalling, al- 
though in the case of many of the ATM devices, this 
signalling is handled by a management process on the 
attached workstation, rather than by the device itself. 

Typically, the device manager will connect the data 
stream directly to the sink or source; however, the con- 
trol stream would normally be connected to a local syn- 
chronization process. For example, a host that wishes 
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to send synchronized audio and video, will do so by 
having the audio node and camera send the audio and 
video data streams separately (they have to end up in 
different devices too, at the other end), while a local 
process will merge the two control streams into a com- 
bined control stream for the playback control process 
at the rendering end. The playback control process is 
then responsible for the synchronization of the play- 
out of the various streams arriving at it, based on the 
source synchronization information from the remote 
manager(s) and data arrival events. 

The Pegasus File Server, which can also be viewed 
as a multimedia device in this context, uses the con- 
trol stream associated with an incoming data stream 
to generate index information that can later be used to 
go to specific time offsets into a media file or a set of 
synchronized files. 


2.3 Systems Components 


An overview of the Pegasus architecture is shown in 
Figure 4. In this figure, we can distinguish a Pegasus 
multimedia workstation, multimedia compute server, 
storage server and Unix server, all interconnected by 
an ATM network. 

Each site is using locally developed ATM switches 
to provide the ATM network: the Fairisle switch in 
Cambridge (Leslie and McAuley [1991]), and the Rat- 
tlesnake switch in Twente (Smit [1994)). 

The architecture of the multimedia workstation is as 
described above; multimedia input and output devices 
are connected to a local ATM switch (for which we 
use the Fairisle switch) and the rest of the workstation 
is entirely conventional. The multimedia processing 
nodes do not have special devices attached to them. 

The multimedia workstations and processor nodes 
are controlled by a microkernel, called Nemesis. This 
kernel, which is discussed in more detail in Section 3, 
provides support for multimedia applications: timely 
scheduling and efficient interprocess communication. 

One or more nodes in Pegasus run Unix. Applica- 
tions on this platform have access to a rich collection of 
tools — compilers, text processors, graphics support, 
etc. — which, due to available effort, we do not intend 
to make available on the Nemesis kernel. We expect 
many multimedia applications to be split over Unix and 
Nemesis; the Unix part will contain the control func- 
tionality, whereas the Nemesis part will contain the 
necessary real-time functionality for audio and video 
processing. 

This separation is entirely inspired by practical con- 
siderations. The Pegasus design team does not have 
the resources to add the kind of scheduling necessary 
for multimedia processing to existing operating-system 
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platforms, they are too big to modify”. Separating 
Nemesis and Unix gives us the best of both worlds: a 
testable and measurable platform for multimedia ap- 
plications and all the functionality of Unix. It is for 
another project to port Nemesis functionality to Unix 
or vice versa. 


3 Kernel Support 


The Nemesis kernel implements several unusual fea- 
tures, some of which are present to aid in the implemen- 
tation of multimedia applications, others for the simple 
reasons of efficiency and tidiness. Here we summarize 
the major features. 


3.1 Memory Model 


A Nemesis kernel provides a number of distinct, 
schedulable entities, called domains. While all do- 
mains share the same virtual address space, privacy 
and protection are implemented using the appropriate 
access rights in the virtual address translations. Code 
executing within a domain may access memory within 
another domain only if both domains have explicitly 
arranged to share the memory. 

Some examples highlight the approach: shared li- 
brary segments would be mapped readable in every 
domain; a unidirectional inter-domain communications 
channel would be mapped read/write in the source and 
read-only at the sink; objects may be shared in shared 
read/write segments; etc. 

The cost of using a single address space is the penalty 
of load-time relocation. We try to amortise this cost 
by caching the results of such relocations and then aim 
to reload an application at the same virtual address at 
which it was last executed. In this we are helped by the 
use of 64-bit VM architectures, which allow a sparse 
allocation of addresses so that we can arrange reuse 
with high probability. Consider for example allocating 
the top 32 address bits of a 64 bit virtual address based 
on a 32-bit hash function of the code to be executed. 

The benefits of a single address space we are aiming 
for are: simplified sharing of data structures (in par- 
ticular objects) between domains, and the removal of 
virtual address aliases which can result in significant 
context switch costs with caches accessed by virtual 
address. 


3.2 Virtual-Processor Model 


A domain differs from the normal concept of a user 
process in the way in which the processor is presented 


2 Yes, even Mach 3.0. 
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to it. In the case of a process, the processor is taken 
away from it by suspending it and is returned by re- 
suming the process to exactly the state in which it was 
when it was suspended. This gives the illusion to the 
process that it is running on its own virtual processor, 
it also hides from the process any information about 
the current processor availability — the process has no 
way of knowing when it has the processor. 

In Nemesis, the processor is taken away from a do- 
main by deactivating it; deactivation involves storing 
the state of the processor into a data structure shared by 
the kernel and domain, the Domain Information Block. 
When the domain is next scheduled, the processor is 
given to a domain by activating the domain; activation 
involves transferring execution to an address specified 
in the activation vector entry in the Domain Informa- 
tion Block. 

For a domain supporting a traditional single- 
threaded model of execution, activation start up code 
would just restore the saved context and the user code 
would continue to execute. Another common use 
within the Pegasus project would be for the entry point 
to be a user-level thread scheduler. In this case the 
mechanism provides functionality similar to scheduler 
activations (Anderson et al. [1991]). Finally, some do- 
mains may be completely event driven, for example, 
device driver domains. 

Hence it is simple to support the standard program- 
ming models on this activation model; in fact all op- 
erating systems do it, but it is usually the case that 
the asynchronous nature of interrupts and rescheduling 
events is hidden from the user level code. 


The Nemesis mechanism provides a number of ad- 
vantages for the types of multimedia applications we 
are considering. First, it provides a means of informing 
applications when they have the processor; a user-level 
scheduler can use this information, together with the 
current time, to make more informed decisions about 
the fate of the threads which it controls. Second, be- 
cause thread scheduling is performed by the applica- 
tion, the user-level scheduler has direct control over 
the behaviour of its threads, and does not have to resort 
to describing their behaviour to a central scheduler in 
terms of priorities and deadlines. Third, once a do- 
main is given the processor, it keeps it until its time 
quantum expires, or it voluntarily yields the proces- 
sor because it has no more work to do. This avoids 
the problems encountered in kernel level thread im- 
plementations when threads block in the kernel and 
the kernel scheduler gives the processor which was 
running the blocked thread to a thread belonging to an- 
other process. Nemesis has no blocking system calls 
except “suspend” which will typically only be called 
by a domain user-level thread scheduler. 


3.3. Domain Scheduling 


To explain the scheduling mechanism adopted in 
Nemesis requires an understanding of how we see a 
flexible multimedia platform being used. The alloca- 
tion of resources to applications will not be controlled 
solely by the applications themselves. Rather, we see 
users being able to control processor allocation much 
in the same way that they control pixel allocation in 
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window systems. Thus, applications will not always 
get what they want; they will have to adapt to the re- 
sources they are given. However, for a particular time, 
seconds or tens of seconds, some of the resources given 
to an application may be viewed as “guaranteed”. The 
application may choose to use an particular algorithm 
on the basis of this guarantee. It may also be able to ex- 
ploit unguaranteed resources which become available 
fortuitously. 

The approach to scheduling in Nemesis is to sched- 
ule domains with a weighted scheduling discipline, 
where the weights are calculated from the user’s cur- 
rent policy. Within a given time frame, not all domains 
may use their allocation; the policy for sharing out 
remaining resources is still the subject of investiga- 
tion. While domains have some processor allocation 
remaining, the current scheduler implementation uses 
an earliest deadline first algorithm to select between 
them. 

Above this primitive-level scheduler, and running on 
a longer time scale is a Quality-of-Service-manager do- 
main whose task is to update the scheduler weights; this 
is performed not only in response to applications en- 
tering or leaving the system, but also adaptively as ap- 
plications modify their behaviour — this is performed 
on a longer time scale that the individual scheduling 
decisions in order to smooth out short-term variations 
in load. 


3.4 Events 


Nemesis provides a single mechanism by which do- 
mains can communicate the occurrence of events to 
each other — this also includes indications from in- 
terrupt handlers. A domain is eligible for scheduling 
when it has pending events, at which point it is included 
in the scheduling mechanism described above. Then, 
when a domain is activated, it is informed of pending 
events. 

Events themselves do not carry values, but merely 
indicate that something has occurred. This may be the 
updating of a shared object, the arrival of a message 
from the network, passage of time, etc.; however, clo- 
sures (ie. methods and data) are associated with each 
event and hide this heterogeneity from the event dis- 
patcher. 

The examples of a protocol domain processing arriv- 
ing packets and inter-domain procedure calls highlight 
the need for two types of event signalling: synchronous 
and asynchronous, depending on whether signalling an 
event should cause a domain to voluntarily give up the 
processor to the signalled domain or continue execut- 
ing. In the inter-domain call example, implemented 
using a pair of message queues in shared memory be- 
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tween the relevant client and server domains and a pair 
of events, lowest latency for a client/server interaction 
will be achieved by the client and server implementing 
the synchronous form of notification. However, a do- 
main performing demultiplexing of incoming packets 
may be most efficient using the asynchronous means. 


3.5 Kernel Privileged Sections 


Device drivers and other trusted modules need to be 
able to protect themselves against interrupts, have ac- 
cess to privileged instructions, etc., for some part of 
their operation. The code that requires this access is 
often a tiny proportion of the total module; however, 
most operating systems would require that the whole 
module run in kernel mode, whether linked statically or 
dynamically loaded. Furthermore, it becomes a prop- 
erty of the code that it runs in kernel mode, rather than 
the data the code is manipulating. 

Nemesis offers the concept of the Kernel-Privileged 
Section to meet the requirement for a dynamic and 
extensible means to provide access to kernel mode. 
Privileged domains may define sections of their code 
which need to be executed in kernel mode. In a block- 
structured language this would naturally be a basic 
block enclosed with some formof TRY ... FINALLY 
construct allowing privileged code to raise exceptions 
but forcing the thread to leave kernel mode before 
any handler outside the privileged section is invoked 
(see Figure 5). The implementation of the Kernel- 
Privileged Section (i.e. the begin_KPS and end_KPS) 
is highly processor dependent — on 68k, MIPS and 
ARM processors it leads to various traps implemented 
in a non-procedural manner, while the aim on the Al- 
pha is to implement a PAL instruction to achieve the 
desired effect. 

In many ways the Kernel-Privileged Section idea 
is akin to using locked critical sections for currency 
control, whereas most other operating systems have a 
model of kernel mode access more akin to monitored 
procedures. 


3.6 Nemesis State 


A primitive form of the Nemesis kernel, Nematode, 
has been implemented on DECstation 5000 (Hyden 
[1994]); this provides domains, events, and scheduling 
support. Currently Nematode is being evolved to con- 
form to the machine independent interfaces defined for 
the Nemesis kernel. 

The VM model and communications abstraction are 
adopted from those used for Wanda (Dixon [1991]); 
migration of this code awaits completion of the Neme- 
sis kernel. 
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. <unprivileged code> 


begin_KPS(); 
TRY 


.. <privileged code> 


FINALLY 
end_KPS() ; 
END; 


. <unprivileged code> 


/* enter privileged section */ 


/* leave privileged section */ 


Figure 5: Coding a Kernel-Privileged Section 


4 Naming and Invocation 


Most objects (entities, things) will be used locally. 
Therefore, most names of objects used will be names 
of local objects. Name resolution should, therefore, be 
most efficient for local names. This implies that local 
names should be shortest and suggests that names of 
local objects should normally be near to the root of the 
naming tree. 

This, it must be clear, is a deviation from a trend 
towards using global name spaces. In a singly rooted 
global name space, the shortest path names refer to 
countries or organizations; it is rare that we wish to 
name those by themselves. The most widely claimed 
advantage of a global name space is that objects have 
the same name anywhere and that this facilitates shar- 
ing. What actually facilitates sharing much more is 
the proper use of naming conventions: One can often 
guess somebody’s electronic-mail address, one looks 
for TeX macro files in subdirectories of /usr/local/lib or 
/ust/lib, one gives C source code files a ‘.c’ extension. 
If the conventions are disobeyed, programs fail. 

By using naming conventions properly, one can cre- 
ate name spaces that are only global in the sense that 
any object anywhere can be named, but not necessarily 
by the same name everywhere. The root of the nam- 
ing tree can be the most local object and longer path 
names generally name objects further away. Conven- 
tions must be used to allow object sharing and there is 
no reason why one convention could not be the use of 
a subtree named /global for global names. 

This sort of naming is used in Plan 9 from Bell Labs. 
Pike et al. [1993] have already put forward some of the 
arguments for naming conventions being more impor- 
tant than global name spaces. Our naming mechanisms 
have been heavily inspired by those of Plan 9 as shall 
become clear. 

Every process starts up with a built-in name space. 


Usually, this name space is inherited from a parent 
process and is at least partly shared with other name 
spaces. The name space consists of a local name 
space which names objects local to the process, and 
mounted name spaces which name objects external to 
the process. The mount point of a mounted name space 
is a local object with a connection to a name space in 
another process. Name resolution in mounted name 
spaces takes place by making name-lookup requests 
through the connection to the other process. The result 
of this resolution is an object handle. 

Using an object handle, objects can be accessed 
through their methods. The precise manner in which 
methods are invoked depends upon the “domain rela- 
tion” between invoker and object. If they share a pro- 
tection domain then the invocation is a procedure call; 
when they are in the same address space but different 
protection domains (for example on the same Neme- 
sis machine) invocation is by protected call; and when 
in different address spaces invocation is performed by 
remote procedure call. 

When making an invocation there is always code at 
the invoker’s end that depends on the call interface. 
In the case of a local procedure call, this interface- 
dependent code is generated by the compiler. In the 
case of system calls it is loaded from a library and in 
the case of remote procedure call it is generated by a 
stub compiler and linked with the rest of the caller’s 
code. 

Client stubs for far-away objects may do more than 
just transport call parameters to the remote objects; they 
may, for instance, perform caching so that there is no 
longer a one-to-one mapping between client calls to the 
stubs and calls to the remote objects. Such intelligent 
stubs are referred to as agents or clerks. 

When objects can migrate, for instance, to where 
they are accessed, the interfaces to them may change. 
This means that the interface with which calls are to 
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be made is not always known a priori; the calling 
code depends on where the object is found when it is 
invoked. 

Early distributed systems solved this by using the 
most general invocation method always: remote pro- 
cedure call. This is not an optimal solution, especially 
now that dynamic linking can be used to invoke opti- 
mal code for the kind of call to be made in the case at 
hand. 

An object-naming mechanism can be used to make 
the mechanism whereby object-interface code is loaded 
transparent. In our model, the resolution of the name 
of an object results in a handle. This handle is essen- 
tially a pointer to the interface to the object. For our 
handles we use maillons (Maisonneuve, Shapiro and 
Collet [1992]), which consist of an opaque, fixed-size, 
object reference and a pointer to a function that re- 
turns the address of the interface when called with the 
reference as argument. The extra level of indirection 
provided by the maillon allows connections to objects 
to be set up, or objects to be fetched before their first 
invocation, but in the most common case — the object 
is already there and ready to be invoked — the maillon 
imposes very little overhead. 

Object handles are first-class objects in that they can 
be passed as arguments in local and remote procedures. 
Passing an object handle for a local object to a remote 
process has the side effect of creating a connection 
through which the object can be invoked remotely. 

The Pegasus remote-procedure-call mechanism is 
based on ANSA’s RPC and layered on MSNA 
(McAuley [1989]). The Multi-Service Network Ar- 
chitecture is a protocol hierarchy for ATM networks 
that also caters for continuous-media transport. 


5 Storage 


The storage system in Pegasus is intended to store tra- 
ditional file data as well as multimedia data efficiently. 
A storage service for multimedia data must have a large 
storage capacity (video produces half a megabyte per 
second compressed, so a half-hour video already occu- 
pies a gigabyte) and a guaranteed (fixed) service rate. 

Ordinary data usually occupies less space and does 
not require a guaranteed service rate. The data rate 
does not have to be constant, but should be as high 
as possible. Locality of reference can be exploited by 
caching data in client and/or server memory. Most 
modern file systems demonstrate that caching yields 
substantial performance gains. 

This applies to naming data too, albeit that directo- 
ries can be cached more effectively when the semantics 
of directory operations are exploited in the caching al- 
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gorithms. 

In contrast, caching video and audio is usually not a 
good idea. If the system can already guarantee the ap- 
propriate rate for a video or audio stream when it is not 
cached, caching it will only use up memory, but cannot 
result in a higher performance — a fixed performance is 
desired. To make matters worse, caching would often 
be counterproductive: Most video sequences and many 
audio sequences are larger than the cache, so, by the 
time a user has seen, or an application has processed, a 
video to the end, the beginning has already been evicted 
from the (LRU) cache. 

Since different kinds of data require different treat- 
ment in our storage service, it was decided to make 
a hierarchical design for it, where a common bottom 
layer is responsible for reading and writing the data on 
secondary and tertiary storage devices and maintain- 
ing the storage structures on them. Above this layer, 
different service stacks can be built using specialized 
algorithms for particular kinds of data. 

These service stacks can be partially or wholly mir- 
rored in file-server agents on client machines. Thus, 
caching strategies, for instance, can be jointly imple- 
mented by corresponding layers of code in client and 
server machines. 

The service stack for continuous data on the server 
has been designed to interact directly with the multi- 
media devices of Pegasus. As described in Section 2, 
continuous streams composed of several substreams 
(synchronized video and audio is a typical example) 
will cause several data streams and one control stream 
to be generated. The storage server stores the data 
streams and uses the control stream to generate index- 
ing information. This information then allows reading 
synchronized streams from a particular point, and fast 
forward, reverse play, etc. of these streams. 

The bottom layer of the Pegasus storage service is 
called the core layer. It manages storage structures on 
secondary and tertiary storage devices and carries out 
the actual I/O. Pegasus uses a log-structured storage 
layout as was exemplified by Sprite LFS (Rosenblum 
and Ousterhout [1991]). 

The log is segmented in megabyte segments. Each 
segment is striped across four disks. A fifth disk is used 
as a parity disk and allows recovery from disk errors. 

Normal file data ends up in the log similarly to Sprite 
LFS. Continuous data, however, is collected in sepa- 
rate segments, although their metadata (the inodes or 
pnodes as we call them) are appended to the normal 
log. 

The speeds of modern disks are such that the over- 
head of seeks between reading and writing whole seg- 
ments is less than ten per cent, so that a transfer rate of at 
least five megabytes per second per disk is possible on 
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high-performance disk hardware. Striping over four 
disks makes a total bandwidth of 20 MB per second 
possible. We have not been able to test this yet, since 
our ATM network runs only at a mere 100 megabits 
per second, just over 10 MB per second. 

Partly as a consequence of storing multimedia data, 
we have to expect that our storage service will grow 
large. We have set ourselves the goal to make the 
storage-service algorithms scale to a system size of 10 
terabytes. Cleaning’ algorithms for a storage service 
of this size have to be designed carefully. If any part of 
the cleaning process scales with, say, the square of the 
system size, cleaning a terabyte file system will take a 
very long time. 

We are currently implementing a cleaning algorithm 
whose complexity only depends on the number of seg- 
ments to be cleaned and the amount of ‘garbage’. 
Roughly, it works as follows. During normal oper- 
ation of the file system, the core maintains a garbage 
file. Every time a client write or delete operation cre- 
ates garbage, an entry describing the hole in the log 
that corresponds to the obsolete data is appended to the 
garbage file. 

When the file system needs to be cleaned, the 
garbage file is read and its entries are sorted by seg- 
ment number. Then, a single pass through the garbage 
file is needed to find and clean all segments containing 
garbage. When cleaning is complete, the garbage file 
is truncated to a single entry describing the old garbage 
file itself. 

Allowing client operations to continue during clean- 
ing does not complicate the cleaning algorithm. At 
the start of a cleaning operation, the current place in 
the garbage file must be marked and cleaning uses only 
information before the marker while new garbage is ap- 
pended after it. When cleaning is complete, the portion 
of the garbage file before the marker is deleted. 

The first prototype of the core of the Pegasus file 
server now runs, with an incomplete cleaning mecha- 
nism. Higher-level services are being added; a Unix 
v-node interface is installed which allows the storage 
system to be used as a Unix file system. 

Since files are stored on RAID, recovery from disk 
failures is straightforward. Once files have reached the 
disk, itis unlikely that they will be lostin acrash. Files, 
therefore, should be put on disk as soon as possible after 
they are written by the application. However, from a 
performance viewpoint, disk writes should be delayed 
so that overwrite operations and delete operations can 
be exploited to save disk operations. In the Pegasus 


3 Cleaning in a log-structured filing system is the act of recovering 
space which holds out of date information. Information may become 
out of date either because a later copy has been written or it has been 
logically deleted. 


storage service we have tried to get the best of both 
worlds. 

For this, we make use of the assumption that client 
and server machines crash independently. When an 
application makes a write operation, the client agent 
sends the data to the server and keeps a copy of the 
data in its buffers. When the server receives the data, 
it acknowledges this to the client agent which, in turn, 
unblocks the application. The data is now safe un- 
der single-point failures: when the server crashes, the 
client agent notices and either writes the data to an al- 
ternative server or waits for the crashed server to come 
back up; when the client machine crashes, the server 
will complete the write operation. 

When there is a power failure, client and server will 
crash together. To guard against this, the servers can 
either be equipped with battery-backed-up memory, or 
with an uninterruptible power supply. With the latter, 
when a power failure occurs, the server has time to 
write its volatile-memory buffers to disk and halt. 

These mechanisms obviate the need for writing data 
to disk quickly. For normal file traffic, this is not only 
beneficial for write performance — Baker et al. [1991] 
showed that 70% of files are deleted or overwritten 
within 30 seconds — but also for cleaning performance: 
The data that does eventually get written to the log is 
reasonably stable, so garbage is created at a much lower 
rate. 


6 Conclusions 


The Pegasus project reflects our belief that if distributed 
multimedia is to be supported effectively, a holistic ap- 
proach to system design is required. Multimedia is not 
just a bolt on; it requires a fundamental reexamination 
of most aspects of the infrastructure. We have thought 
carefully about integrating multimedia devices into the 
network architecture of the system, we have looked 
at the data paths from camera lens to display screens, 
and we have analysed storage infrastructures from a 
performance, reliability and consistency perspective. 

Thus far we have found that this approach gives a 
clean system design and makes our implementations 
efficient and simple. The desk-area network as the 
connecting infrastructure for machines and devices has 
greatly simplified the architecture of the rest of the 
system. 

In the storage service, we have discovered that tech- 
niques for consistent caching, data buffering, log struc- 
ture and RAID, each of which, by itself, is difficult 
to integrate in an existing environment, can be com- 
bined in a new storage system architecture. Consistent 
caching, buffering and RAID gave us reliability (no 
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data loss in a single crash); log structure and RAID 
give us good write performance. 

Pegasus is only half-way through its funding period 
now and a lot of work still needs to be done. We hope 
we can demonstrate a complete system in two years’ 
time. The results of our project are naturally public 
and we intend to make all code available where it is not 
restricted by licences from others. 
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Abstract 


The Whitehead Institute/MIT Center for Genome 
Research is responsible for a number of large genome 
mapping efforts, the scale of which create problems 
of data and workflow management that dictate 
reliance on computer support. Two years ago, when 
we started to design the informatics support for the 
laboratory, we realized that the fluid and ever- 
changing nature of the experimental protocols 
precluded any effort to create a single monolithic 
piece of software. Instead we designed a system that 
relied on multiple distributed data analysis and 
processing tools knit together by a centralized 
database. The obvious choice of operating systems 
was UNIX. In order to make this choice palatable to 
the laboratory biologists—who rightly consider it 
their job to do experiments rather than to interact with 
computers, and who have come to expect all software 
to be as intuitive and responsive as the Apple 
Macintoshes on their desks—we designed a system 
that runs automatically and essentially invisibly. 
Whenever it is necessary for the informatics system 
to interact with a member of the laboratory we have 
carefully chosen a user interface paradigm that best 
balances the user’s expectations against the system’s 
capabilities. When possible we have chosen to adapt 
familiar software to our user interface needs rather 
than to write user interfaces from scratch. We’ve 
managed to hide the power of UNIX behind the 
innocuous personal computer-based front ends our 
users know and love, using techniques that should be 
applicable in other environments as well. 


1. Introduction 


The Whitehead Institute/MIT Center for Genome 
Research (WI/MIT CGR) carries out large-scale 
genome mapping projects. A genome map is 
composed of a large number of short DNA sequences 
called “markers” which have been ordered and 
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assigned to unique positions on chromosomes 
[National Research Council, 1988]. The availability 
of such maps greatly simplifies the task of identifying 
and isolating genes relevant to the understanding of 
development and disease. There are two main 
genome mapping projects at the WI/MIT CGR: the 
creation of a genetic map of the mouse [Dietrich et al. 
1992], and the creation of a physical map of the 
human [Green and Olson, 1990]. We estimate that 
these projects will require the completion of several 
million individual experimental steps. This paper 
describes the design of the WI/MIT CGR informatics 
system and the lessons we learned during the process. 


1.2 Choosing UNIX 


Managing information flow in laboratory projects of 
this scale presents several challenges. The first 
challenge is managing the laboratory data for each 
project. The second is data analysis. The third is 
managing the dissemination of information both 
within and outside the laboratory. In addition there is 
a meta requirement: biomedical research protocols 
are a moving target. Laboratory techniques are 
constantly improving, and major and minor 
adjustments of the experimental protocols occur on a 
regular basis. 


We chose to base our informatics system on 
the UNIX operating system for several reasons. First, 
a large number of UNIX utilities for analyzing 
molecular biology data already exist in the public and 
commercial domains. Second, UNIX is an open 
system that is available on many different platforms 
and is familiar to the academic world. Finally, the 
tool-based philosophy of UNIX [Kernighan and 
Plauger, 1978], with its emphasis on inter-process 
communication, lends itself to a modular design. We 
felt that the use of multiple modular data analysis and 
processing tools instead of a single monolithic piece 
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of software would allow us to respond more promptly 
to changes in the laboratory protocol. 


However, the choice of UNIX, rather than a PC- 
based operating system such as MS-DOS, OS/2 or the 
Macintosh OS presented user interface problems. The 
laboratory scientists are familiar with personal 
computers, primarily the Apple Macintosh, and 
expect software to behave as it would on a desktop 
system. We could not reasonably expect biologists in 
the laboratory to master a series of data analysis tools 
running under an unfamiliar operating system. 


1.3. The Laboratory Protocol 


A flowchart of the mouse genetic mapping protocol is 
shown in Figure 1. The aim of the protocol is to 


obtain small random DNA sequences that contain 
simple sequence repeats, such as (CA), flanked on 
both sides by nonrepetitive sequences. These 
sequences are useful because they are frequently 
“polymorphic” between inbred mouse strains. In 
other words, the length n of the repeat varies between 
two or more strains of mice. Like eye or hair color, 
the length of the repeat is a genetic trait that is 
transmitted from one mouse to its progeny, and like 
other genetic traits, it can be mapped to a particular 
position by performing a series of genetic linkage 
analysis experiments. 


The protocol begins by creating a library of 
mouse DNA sequences. This is done by cutting up 
whole mouse DNA into small pieces and inserting 
them into a self-replicating virus. Each viral clone in 





1, Screen mouse DNA library for clones carrying repeats 


2. Sequence clones on automated sequencer 


3. Proofread sequences >> reject: sequence bad 


4, Strip off non-mouse (viral) portion of sequence +> reject: mouse portion of sequence too small 


5. Find the simple sequence repeat >> reject: no repeat 


6. Check for duplicates —>> reject: duplicate found 


7. Check for highly repetitive sequences—>> reject: repetitive sequence found 


8. Check GenBank for similar sequences 


\o 


. Find PCR primers flanking repeat > reject: no suitable primers 


perhaps pick different primers 


10. Review analysis and confirm primers 


11. ov, primers and wait for receipt 


12. Screen panel of inbred strains for polymorphisms» _ reject: not polymorphic 


13. Determine inheritance pattern on mapping cross 


check errors and revise genotype 


14. Error-check inheritance pattern data 


15. Map marker 


16. Distribute map 


Figure 1: Genetic Mapping Protocol 
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the library contains a different random piece of 
mouse DNA. These clones are individually tested to 
find those that are likely to contain repeats, and each 
likely candidate is sequenced on an automated 
sequencing machine. Next comes a series of sequence 
analysis steps. First, since each clone consists of a 
mixture of viral and mouse DNA, the known viral 
portion of the sequence must be found and 
conceptually stripped off. Next, the sequence is 
scanned to identify the simple sequence repeat, if 
any. After this, the sequence is checked for duplicate 
sequences already present in our database and for the 
presence of highly repeated sequences that are 
present multiple times in the genome. If the sequence 
survives these tests, it is next compared to all entries 
in the GenBank database of published sequences in 
order to determine whether this sequence has ever 
been seen before. Finally, we choose a pair of short 
(about 20 base pair) sequences on either side of the 
simple sequence repeat to serve as “primers” for a 
biochemical technique known as the polymerase 
chain reaction (PCR). PCR allows us to rapidly 
determine the length of the simple sequence repeat 
without tediously recloning and sequencing it. 


After having a biotechnology supply 
company synthesize the primer pairs, we characterize 
the simple sequence repeat further. Using PCR we 
determine the lengths of the simple sequence repeat 
in DNA taken from 12 common inbred mouse strains. 
Those repeats that are not polymorphic, that is, that 
do not vary in size between strains, are discarded. 
Those that are polymorphic are subjected to genetic 
mapping experiments that determine the length of 


each of the repeats in the offspring produced by 
mating two of the inbred strains. Repeats that are 
close together on a chromosome will tend to remain 
together when inherited — they will appear to be 
“linked” — while those that are further apart or on 
different chromosomes entirely will be inherited 
independently of each other. The genetic mapping 
data is now fed into a program which does the 
number crunching necessary to order each of the 
markers and determine the distance between them. 


The genetic mapping protocol is essentially 
a data pipeline. At any given time approximately 600 
sequences are in the midst of processing. 
Experiments can fail and need to be redone, or 
sequences can be found to be unsuitable and be 
dropped from the pipeline at various steps. While 
managing the protocol certainly requires ingenuity in 
data modeling and data processing, we have found 
one of the most challenging tasks to be integrating the 
Apple Macintosh-based work habits of the laboratory 
with our tool-based philosophy of UNIX. 


2. Informatics System Architecture 


The architecture we have chosen is diagrammed in 
Figure 2. The major features are: 


- A centralized database running on a UNIX 
workstation that stores the experimental results of all 
steps in the mapping protocol. Our database is called 
“MapBase” [Goodman et al. 1993, Goodman 1994] 
and is an object-oriented database written in C++. It 
is a multiconnection client/server database that 
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Figure 2: Informatics system architecture 
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supports a simple query and update language and is 
accessible over the TCP/IP network. An important 
design decision was to make MapBase accessible 
only to software tools and to programmers. Our end 
users—scientists and laboratory personnel—never 
interact with the database directly. 


- Data analysis and manipulation clients that 
run automatically on UNIX workstations, often under 
the UNIX cron utility. The system is decoupled so 
that the database and data analysis clients do not have 
to reside on the same computer. Scientists do not 
need to take any action to initiate the data processing, 
but the system keeps them apprised of experimental 
progress by E-mailing status reports. 


- Whenever human interaction with the 
informatics system is necessary, it is done through 
familiar spreadsheet, E-mail, and word-processing 
software running on the same personal computer the 
laboratory personnel use for other aspects of their 
work, the Apple Macintosh. This off-the-shelf 
software provides scientists with comfortable 
interface paradigms for interacting with the 
informatics system. By carefully matching the 
paradigms to the tasks, we have been able to match 
user expectations with the system’s capabilities. 


- The MapBase database uses a text-only 
query and transaction language which was designed 
to be easily machine-parseable. While many of our 
data manipulation clients interact directly with 
MapBase, most of them, particularly the adapted 
Macintosh programs, communicate via intermediary 
perl scripts [Wall and Schwartz, 1991] which handle 
the translation between MapBase and the client. 


2.1 Choice of User Interface Paradigms - Lessons 
We Have Learned 


Our first pass at creating user interfaces between the 
laboratory and the informatics system was disastrous. 
Adopting the conventional approach, we wrote a 
number of specialized applications for the graphical 
input and display of laboratory data. To our chagrin, 
biologists stubbornly kept their primary data in 
various Excel spreadsheets that they had created 
themselves, transferring the data to the informatics 
system in large batches only when absolutely 
necessary. They requested that we make our 
graphical data entry forms more spreadsheet-like, and 
were unhappy when we were unable to reproduce the 
full functionality of Excel. They tired of waiting for 
our interactive database queries to complete and 
developed a habit of E-mailing their data requests to 
the programmers. 


Our second attempt to create user interfaces 
took a different approach. Rather than build user 
interfaces from scratch, we took the software and data 
management techniques the lab members were 
already using and modified them to work with the 
informatics system. In contrast to the first attempt, 
these interfaces won immediate acceptance and are 
still in use today. What follows are specific examples 
of laboratory user interfaces we have created and the 
general lessons we have drawn from this experience. 


2.1.1 The best user interface is no interface at all 


From the laboratory’s point of view, computers are at 
their best when they work invisibly behind the 
scenes. When possible we have made our data 
processing steps invisible and automatic. For 
example, every new DNA sequence that is entered 
into MapBase needs to go through a series of 
software checks and characterizations: the sequence 
has to be checked against other sequences in the 
database to catch duplicates. If it passes this test, the 
sequence is checked against worldwide DNA 
sequence databases to see if it has already been 
mapped. Next the sequence is examined for the 
presence of repetitive elements (sub-sequences 
known to be present multiple times in the genome), 
and so on. Rather than ask a member of the 
laboratory or (heaven forbid) a programmer to initiate 
these tasks, the informatics system does it 
automatically. Every night a cron job queries 
MapBase for new DNA sequences, and feeds any that 
are found through a series of small programs that 
perform each of the characterization steps. These 
programs are, in fact, implemented as a series of 
filters connected together through UNIX pipes. The 
program at the very end of the pipeline gathers up the 
results, feeds them back into the database, and E- 
mails a status report to the scientists in the laboratory. 


The sequence characterization programs are 
a nice illustration of the UNIX toolkit approach. The 
input for each of the programs is a series of 
keyword/value attributes in the format 
KEYWORD=VALUE. Keywords identify the 
sequence and the information collected on it in 
previous steps. Programs in the pipeline extract the 
attributes in which they are interested, pass through 
the rest, and add any information that they wish to 
contribute. For example, the program that determines 
the start and length of the simple sequence repeat 
looks for the following attributes in the data stream 


NAME=MJ100 
SEQUENCE=GATTGACGAGATCACAGTTTGGCACAC 
ACACACACACACCAAGTTGAATTTCCTGG 
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and adds to it the following (passing through the 
name, sequence and any other attributes present): 


REPEAT_START=22 
REPEAT_LENGTH=20 


The particular order in which sequence 
processing steps are performed is determined by a 
shell script, which makes it easy to rearrange the 
order in which processing is performed, insert data 
processing modules, or experiment with alternate 
algorithms. 


Another example of invisible processing is 
the program that assembles genetic maps. Every night 
new mapping data is incorporated into the growing 
genome maps by a cron-invoked shell script and the 
newly constructed maps are then E-mailed to the 
scientists. In UNIX style, the data processing steps 
are performed by separate programs. A large C 
program called MAPMAKER [Lander et al. 1987], 
does the actual multipoint genetic linkage analysis. A 
second program interprets the numeric output from 
MAPMAKER and converts it into graphical maps, 
while a third converts the maps into Macintosh PICT 
documents and E-mails them to the scientists. 


2.1.2 For moving large amounts of data use “drop 
folders” 


There are times when it is necessary for data to be fed 
into the informatics system in large chunks. One 
example of this is the entry of raw DNA sequence 
information into the database. Automated DNA 
sequencing machines produce this data in the form of 
Macintosh text files. The challenge is getting these 
files from the Macintosh into MapBase. Rather than 
asking laboratory members to cut and paste these 
files into a specialized data entry program or to use a 
UNIX utility such as ftp, we have designated a “drop 
folder” on a disk that is cross-mounted between the 
UNIX and Macintosh systems. This disk appears as a 
NFS volume to the workstations and as an 
Appleshare volume on the Macintosh desktops. To 
transfer the sequence files to the UNIX system the 
user just drags them into the drop folder. A cron- 
launched perl script checks this folder periodically for 
new files, reformats them, and feeds them into 
MapBase. 


2.1.3 To represent tabular data use a spreadsheet. 


There are situations that require more give and take 
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Figure 3: Portion of spreadsheet used in data entry for mouse geneti c mapping protocol 
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between the computer and the scientist than is offered 
by either behind-the-scenes processing or one way 
data transfer. For these situations we make extensive 
use of Microsoft Excel for data entry, viewing and 
manipulation. The spreadsheet has become a familiar 
and intuitive user interface, and is in fact the way that 
most of the scientists in the laboratory are 
accustomed to storing and organizing laboratory 
results. 


We have created a set of external code 
modules (“plug-ins”) that give Excel network access 
to MapBase and other UNIX tools. With the 
appropriate macros, these plug-ins allow custom 
spreadsheets to write, retrieve and update MapBase 
records. In this way a scientist can open up a 
spreadsheet that lists a subset of the sequences and 
the experimental results associated with them. 
Although the data appears to be local to the 
spreadsheet and can be copied, pasted, summarized, 
printed and otherwise manipulated in the usual 
spreadsheet manner, it is actually tied to the 
underlying data structures in MapBase. To add or edit 
data, the scientist makes the appropriate changes in 
the spreadsheet and chooses the “Send to database” 
menu command, which updates MapBase. A 
screenshot of one of our spreadsheets in action is 
shown in Figure 3 (previous page). This spreadsheet 
is used to enter the lengths of simple sequence repeats 
in various inbred strains. 


In addition to the advantages of familiarity 
and ease of use, this approach has allowed us to 
incorporate data analysis and data integrity checking 
tools to the Excel spreadsheets in a simple and 
extensible manner. The most frequently-used Excel 
plug-in simply sends the contents of the spreadsheet 
as tab-delimited text over a TCP/IP socket to a 
waiting perl script. The perl script figures out what’s 
to be done with the spreadsheet data, invokes the 
appropriate data analysis and database tools, adds or 
modifies the text and sends it back over the socket to 
the Excel plug-in which obligingly pastes it back into 
the spreadsheet. Adding new behavior to the 
spreadsheet is often as simple as updating the perl 
script. 


A good example of this extensibility is our 
experience when we decided to add error checking to 
the genetic linkage mapping data spreadsheet. A 
frequent source of laboratory error occurs when 
laboratory technicians make typographical errors 
when they type the simple sequence repeat length 
inheritance patterns into the spreadsheet. The error is 
caught that night when a full MAPMAKER run is 
performed and the repeat is found to be unmappable, 
but by this time the data has already been entered into 
MapBase and the technician has put the experimental 


results away. We wished to catch errors at the point 
of data entry, while the experimental results were still 
fresh. While it was impractical to invoke a full 
MAPMAKER for each new simple sequence repeat 
entered, it was possible to write a quick and dirty 
program that roughly maps new repeats and catches 
most errors. By having the perl spreadsheet listener 
invoke this error-checking program, we were able to 
give technicians rough mapping feedback almost 
immediately and to catch the typographical errors 
before the data was entered into MapBase. Best of all, 
we did this without writing any new Macintosh code. 


2.1.4 For database queries, use an E-mail interface. 


E-mail is a particularly effective paradigm for posing 
database queries. Complex MapBase queries can take 
20 to 30 seconds to complete. While this is not a 
particularly long wait, it is too long for an interactive 
session mediated by a graphical user interface, where 
an immediate response is expected. In contrast, 
scientists are accustomed to using E-mail to query 
each other, and they expect a delay of minutes to 
hours between sending out a request for information 
and receiving a response. Scientists can address ad 
hoc queries to MapBase via familiar E-mail software, 
Microsoft Mail, using a series of graphical forms 
we’ve designed. By filling out checkboxes and text 
fields, users of the system can set the conditions to 
satisfy and select the data fields to retrieve. They then 
send the form to the database and in less than a 
minute their query is answered by return mail. 
Textual data, such as status reports, is returned as 
word processor documents, while graphical 
information, such as maps, is returned in the form of 
Macintosh picture files. Tabular data, as one would 
expect, is returned as spreadsheet files. This E-mail 
system is also used for posing queries to MapBase 
over the Internet using a set of text-only forms. 
Figure 4 (next page) shows a portion of one of our E- 
mail forms. This one is used within the laboratory to 
obtain information about the progress of the genetic 
mapping protocol. 


The E-mail query system is implemented 
using familiar UNIX tools. An alias called 
“genome_database” pipes incoming mail to a perl 
script that determines which query form is being used 
and where the query is coming from. Using this 
information, the perl script reformats the query form 
into a series of keyword/value pairs. This data is then 
passed to another perl script that queries MapBase 
and reformats the results in human-readable form. In 
some cases, when the user wishes to receive data as a 
spreadsheet or a picture file, the result text goes 
through an addition step in which it is piped through 
UNIX tools that reformat it as appropriate (these 


tools are written in C or perl). Finally the data is E- 
mailed back to the user. 

The translation of queries from the graphical 
format used by Microsoft Mail to the text-format 
expected by the UNIX E-mail query system iS 
accomplished by Microsoft Mail itself. Graphical 
forms sent outside the Microsoft Mail system are 
converted into a textual representation using rules 
that can be specified when the forms are designed. 
We specify conversion rules that are easily perl- 
parseable. 


2.1.5 When all else fails, do it yourself. 


We did of course encounter a small number of 
situations in which existing Macintosh software did 
not handle the job. The two examples that we 
encountered both involved entry of laboratory image 
data. In one case the solution was to capture the 
image via a video camera and interpret it using 
custom image analysis software. In another case the 


solution was to use a digitizing tablet to capture the 
positions of data points using a small Macintosh 
program and then to automatically paste the data into 
an Excel spreadsheet. Both these tasks were made 
easier by the use of Apple Events, which allow 
Macintosh programs to exchange data and coordinate 
their activities. In the case of the former task, the 
software that controls the video camera resides on a 
Macintosh while the software that interprets the 
image data resides on a DEC alpha/OSF 1. To 
integrate the two, we wrote a Macintosh daemon 
process that listens on a TCP/IP port for incoming 
messages from the image analysis program and 
forwards them, in the form of Apple Events, to the 
camera control program. 


3. Conclusions 
We have found the UNIX toolkit approach to be very 


helpful in designing and maintaining our informatics 
system. As the laboratory protocol has changed, 
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we've been able to modify our system by adding or 
altering software modules or changing the order in 
which they execute. We have adopted the same 
approach to the design of our user interfaces. Instead 
of building our own interfaces from scratch, we have 
adapted existing software with which our users are 
already comfortable. The result has been a system 
that has allowed data throughput to grow by a factor 
of six (from 50 genetically mapped sequences per 
month to over 300/month) over a period of a year, 
and that enjoys a high level of user satisfaction. 


Our approach is different from those taken 
by several other genome mapping groups. One 
approach, exemplified by the ACEDB database of the 
C. elegans genome is the “laboratory notebook” 
approach [Sulston et al., 1992], in which the entire 
user interface is incorporated into one custom piece 
of software. An advantage of this system is that the 
interface is carefully crafted to fit the application. A 
disadvantage is that it is difficult to adapt the 
interface to meet changing laboratory needs. A more 
tool-oriented approach has been taken by the 
Chromosome 11 project, in which a relational 
database is used to integrate and control the activities 
of a number of data analysis and manipulation tools 
[Clark et al. 1994]. However in this case the decision 
was made to use a Apple Macintosh-based relational 
DBMS in order to take advantage of that system’s 
graphical user interface, and this design decision has 
restricted the possibilities for automating information 
flow since the Macintosh OS does not provide the 
level of inter-process communication offered by 
UNIX. An approach similar to ours, but for a system 
to support a large-scale expressed sequence project, 
has been described by Kerlavage [Kerlavage et al. 
1993]. They also build a pipeline of UNIX-based data 
manipulation tools drawing on a centralized database 
and interacting with scientists through Macintosh 
front ends. The major difference between their 
approach and ours is that their Macintosh interfaces 
are built on a single environment, Hypercard (Apple 
Computer), whereas we use different applications to 
present differing user interface paradigms. 


Although our system was designed to meet 
the needs of a particular genome laboratory, our 
approach may be applicable in other situations in 
which it is necessary to integrate UNIX and PC 
environments. 
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Abstract 

This paper describes /g-text, an inverted index text re- 
trieval package written by the author. Inverted index 
text retrieval provides a fast and effective way of 
searching large amounts of text. This is implemented 
by making an index to all of the natural-language 
words that occur in the text. The actual text remains 
unaltered in place, or, if desired, can be compressed 
or archived; the index allows rapid searching even if 
the data files have been altogether removed. 

The design and implementation of /q-text are 
discussed, and performance measurements are given 
for comparison with other text searching programs 
such as grep and agrep. The functionality provided is 
compared briefly with other packages such as glimpse 
and zbrowser. 

The /g-text package is available in source form, 
has been successfully integrated into a number of 
other systems and products, and is in use at over 100 
sites. 


1. Introduction 

The main reason for developing /q-text was to provide 
an inexpensive (or free) package that could index and 
search fairly large corpora, and that would integrate 
well with other Unix tools. 

Low Cost Solution: There are already a number of 
commercial text retrieval packages available for the 
Unix operating system. However, the prices for these 
packages range from Cdn$30,000 to well over 
$150,000. In addition, the packages are not always 
available on any given platform. A few packages 
were freely available for Unix at the time the project 
was started, but generally had severe limitations, as 
mentioned below. 

A tool for searching large corpora: Some of the 
freely available tools used grep to search the data. 
While /g-text is O(n) on the number of matches, irre- 
spective of the size of the data, grep is O(n) on the 
size of the data, irrespective of the number of 


matches. This limits the maximum speed of the sys- 
tem, and searching for an unusual term in a large da- 
tabase of several gigabytes would be infeasible with 
grep. Other packages had limits such as 32767 files 
indexed or 65535 bytes per file. Tools available on, or 
ported from, MS/DOS were particularly likely to suffer 
from this malaise. 

Unix-based: Unix has always been a productive text 
processing environment, and one would want to be 
able to use any new text processing tools in combina- 
tion with other tools in that environment. 


1.1 Results and Benefits 

Timings are given in detail below, along with a per- 
spective on how the above goals were met. As a brief 
example, a search over the SunOS 4.1.1 manual pages 
for “‘\<core dump\>” on a Sun 4/110 took 52 seconds 
with grep, and 1-3 seconds with /q-text. In addition, 
Ig-text automatically finds matches of a phrase even 
if there is a newline or punctuation in the text be- 
tween two words. It is also possible to combine /q- 
text searches, finding documents containing all of a 
number of phrases. Complex queries can be built up 
incrementally. 


2. Design Goals 

The main goals of any text retrieval package were 
outlined briefly in the introduction. /gq-text was devel- 
oped with some more specific goals in mind, de- 
scribed in the following sections. 


2.1 Limited Memory 

Developed originally on a 4MByte 386 under 386/ix, 
Ig-text does not assume that any of its index files will 
fit completely into memory. As a result, indexing 
performance does not degrade significantly if the 
data does not fit into main, or even virtual, memory. 


2.2 Offline Storage 
Once files are indexed, /g-text does not need to con- 
sult them again during the searching process. The 
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result of a query is a list of matching files, together 
with locations within those files. The Original text 
files are needed if the user wants to view the matched 
documents, but it turns out that the file names and 
document titles are often sufficient. 


2.3 Matching With Certainty 

Packages that use hashing or probablistic techniques 
often return results that might match the user’s query. 
A ‘bad drop scan’ is then used to reject the false hits 
[Lesk78]. These techniques are incompatible with the 
Offline Storage requirement, since a bad drop scan 
may not be possible. 


2.4 Accurate Phrase Matching 

The package should be able to match a phrase word 
for word, including getting the words in the right or- 
der and coping with uneven whitespace. It should 
also be able to match capitalisation and punctuation 
at least approximately, so that the user can constrain a 
search on someone’s name, Brown for example, to 
match Brown but not brown in the text. Only the first 
letter of each word is inspected for capitalisation, in 
order to minimise the data stored in the index. This is 
sufficient for most queries. 

In the literature, the terms recall and precision 
are used to refer to the proportion of relevant matches 
retrieved and the proportion of retrieved matches that 
are relevant, respectively; the goal of /g-text is to have 
very high precision, and to give the user some control 
over the recall. These terms are defined more pre- 
cisely in the literature [Clev66], [Salt88], and are not 
discussed further in this paper. The term accuracy is 
used loosely in this paper to refer to literal exactness 
of a match—for example, whether Ankle matches 
ankles as well as ankle. In an inaccurate system, 
Ankle might also match a totally dissimilar word such 
as yellow. The information that /q-text stores in the 
database enables it to achieve a high level of accuracy 
in searching for phrases. . 


2.5 Updatable Index 

It should be possible to add new documents to the in- 
dex at any time. It should also be possible to unindex 
documents, or to update documents in place, as well 
as to inform the package when documents are 
renamed or moved. 


2.6 Unix toolkit approach 

It should be possible to manipulate search results 
with standard Unix tools such as awk and sed. This 
must be done in a way consistent with the Offline 
Storage requirement. For example, the user should be 
able to see the first seven matches using head -7 or 
sed 7q without having to search the documents them- 
selves. 


2.7 Summary 

This paper shows how some of the goals have been 
met, and indicates where work is still in progress on 
other goals. All of the original design goals are still 
felt to be relevant. With continued improvements in 
price/performance ratios of disks, the offline storage 
goal may become less important, but the design phi- 
losophy is still very important for cpmom work. 


3. Technology Overview 
This section gives a brief background to some of the 
main approaches to text retrieval. 


3.1 Signatures 

Packages based on signatures keep a hash of each 
document or block of each document. The idea is to 
reduce I/O by identifying those blocks which might 
contain a given word. This method does not store 
enough information to match phrases precisely, and 
software relying on it needs to scan documents to 
eliminate bad drops [Salt88], [Falo85], [Falo87a], 
[Falo87b]. A widely distributed publicly available 
system, zbrowser, uses a cross between block signa- 
tures and the Document Vector method discussed be- 
low. This system can answer proximity-style queries 
(these two words in either order, near each other) 
fairly well, but does not handle searching for phrases 
[(Zimm9 1]. 


3.2 Full Text Inverted Index 

A full text inverted index consists of a record of every 
occurrence of every word, and hence is generally the 
largest in size of the indexes discussed; the index also 
allows the highest accuracy (but not necessarily 
highest precision, see Future Work below). The 
larger index increases I/O needed for searching, but 
on the other hand there is no need to scan documents 
for bad drops [Salt88], [Mead92]. 


3.3 Document Vector 

This strategy keeps a record of every file in which 
each word appears; one could call it a partial inverted 
index. This is usually much smaller than a full text 
inverted index, but cannot be used to find phrases di- 
rectly without a bad drop scan. A recent example is 
Glimpse; this and other examples are mentioned in 
[Salt88], [Mead92], and [Orac92]. Glimpse also appears 
to be restricted to searching a single file system at a 
time on a local machine. 


3.4 Relational Tables 

One way of implementing a full or partial inverted in- 
dex is by storing each occurrence of each word in a 
cell of a relational table. This is generally by far the 
least efficient of the strategies discussed, with in- 
dexes typically three times the size of the data, but it 
is also the easiest to implement robustly. 


3.5 DFA or Patricia Tree 

These systems store a data structure representing a 
deterministic finite state automaton that, when exe- 
cuted against the query, will reach an ‘accept’ state 
representing all matches. They are usually byte 
rather than word oriented, although they can be writ- 
ten either way. It is difficult to allow updates to such 
an index, and the algorithms are fairly complex. 
Knuth describes Patricia trees in some detail [Knut81]; 
a sample in-memory implementation, Cecilia, was 
described by Tsuchiya [Tsuc91]; PAT, the Oxford En- 
glish Dictionary (OED) software, also uses Patricia 
trees [Bray89], [Fawc89]. 


3.6 Other 

Many other approaches are possible. There are sev- 
eral schemes that actually replace the original data 
with, for example, a multiply-threaded linked list, so 
that the data can be recreated from the index. This 
has an unfortunate and well-known failure mode, in 
which the reconstituted text uses incorrect words. 
Other schemes include sub-word indexing, either on 
individual bytes or on n-grams, although these 
usually fall into the ‘DFA’ category above in terms of 
their characteristics. 


4. The /q-text Design 

A full text inverted index was chosen to meet the de- 
sign goals. In particular, this is the only strategy 
which allows accurate matching of phrases without 
reverting to a bad drop scan. 

In order to make the index smaller, however, the 
list of matches for each word is compressed, as de- 
scribed in detail in the Implementation section below. 

The package is implemented as a C API in a 
number of separate libraries, which are in turn used 
by a number of separate client programs. The pro- 
grams are typically combined in a pipeline, much in 
the manner of the probabilistic inverted index used by 
refer and hunt [Lesk78]. 

The /g-text package includes a set of input fil- 
ters for reading documents into a canonical form suit- 
able for the indexing program /gaddfile, to process; a 
set of search programs; and programs that take search 
results and deliver the corresponding text. There are 
also wrappers so that users don’t have to remember 
all the individual programs. 


5. The lq-text Implementation 
This section describes the implementation of /q-text. 


5.1 Information Stored 

Information is stored about the documents indexed, 
and about each distinct word-form (for example, in- 
formation about occurrences of sock and socks is all 
stored under sock; see the discussion of stemming 
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under Indexing below). 

Three distinct kinds of index are used. The first 

uses ndbm, a dynamic hashing package [Knut81]. The 
second type of index is a fixed record size random-ac- 
cess file, read using /seek(2) and acache. Finally, the 
third index contains linked lists of variable-sized 
blocks; the location of the head of each chain is 
stored in one of the other indexes, as detailed below. 
This third index is used to store the variable-sized 
data constituting the list of matches for each word- 
form. The combination of the three index schemes al- 
lows fast access and helps to minimise space over- 
head. 
Dynamic Hashing Packages: These have a number of 
desirable properties, including minimal disk access, 
since usually only two disk accesses are needed to re- 
trieve any item, and automatic expansion, since the 
hash function simply gets wider as the database 
grows, allowing updates at any time. In addition, the 
technology is widely available, since many Unix sys- 
tems include ndbm, and there are also much faster 
ndbm-clone implementations available. 

Two ndbm-clone packages are distributed with 
Iq-text. One of these, sdbm, has been widely distri- 
buted and is very portable [Yigi89]. The other, db, is 
part of the 4.4BSD work at Berkeley, and is described 
in [Selt91]. A general discussion on implementing 
such packages was distributed in [Tore87]. See also 
ndbm(1). 

Two ndbm databases are used: one maps file 
names into File Identifiers (FID), and the other maps 
natural-language words into Word [Dentifiers (wD). 
Fixed Record Indexes: A word identifier (wiD) as ob- 
tained from the ndbm word map is taken as a record 
number into a fixed size record file, widindex. The in- 
formation found in this record is described later; for 
now, it suffices that one of the fields is a pointer into 
the linked lists of blocks in the final file, data. 
Linked Lists of Blocks: Each word can occur many 
times in any number of files. Hence, a variable-sized 
data structure is needed. A linked list of 64-byte disk 
blocks is used. However, where adjacent blocks in 
the data correspond to the same thread, they are co- 
alesced, rather as in the Berkeley Fast File System 
[McKu83]. Although an /seek(2) and a read(2) may 
be required for each 64-byte block, the Unix block 
buffer cache makes this arrangement relatively effi- 
cient. A Least Recently Used (LRU) cache holds a 
number of 16Kbyte segments of these 64-byte 
blocks, giving a significant speedup, especially over 
NFS, where write(2) is synchronous. 


5.2 Per-file Information 
Each document is assigned a File Identifier when it is 
indexed. A File Identifier, or rrp, is simply a number. 
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Conceptually, this number is then used to store and 
retrieve the information shown in Table 1. 





Field 
location 
name 


Example 
/home/hermatech/cabochon/letters 
tammuz.ar(june-14.2) 


title Star Eyes watched the jellycrusts peel 
size 1352 bytes 


12th Dec. 1991 


Table 1: Per-file Information 


last indexed 





In the table, /home/hermatech/cabochon/letters is an 
absolute path to a directory; if a relative or unrooted 
path is given, /q-text will search along a document 
path specified in the database configuration file, or 
by the DOCPATH environment variable. The search is 
performed on retrieval, so that the prefix to the path 
needn’t be stored in the database. The document 
name is here given separately to emphasise that once 
a document has been indexed, it can be moved, or 
even compressed and then stored as a member of an 
archive, as here: /q-text will automatically extract 
june-14.Z from the ar-format archive tammuz.ar and 
run uncompress on the result to retrieve the desired 
document. 

The size and date fields shown in the table are 
used to prevent duplicate indexes of the same file, so 
that one can run /gaddfile repeatedly on all the files 
in a directory and add only the new files to the index 
(no attempt is made to detect duplicated files with 
differing names). In addition, the file size allows lq- 
text to make some optimisations to reduce the size of 
the index, such as reserving numerically smaller file 
identifiers for large files. The reasoning is that larger 
file are likely to contain more, and more varied, 
words. Since the file identifier has to be stored along 
with each occurrence of each word in the index, and 
since (as will be shown) /q-text works more effi- 
ciently with smaller numbers, this can be a significant 
improvement. 

In fact, all of the above information except the 
document title is stored directly in the ndbm index. 
Using a multi-way trie would probably save space for 
the file locations, but the file index is rarely larger 
than 10% of the total index size. The DOCPATH confi- 
guration parameter and Unix environment variable 
supported by /q-text allow the file names to be stored 
as relative paths, which can save almost as much 
space as a trie would, and at the same time allows the 
user to move entire hierarchies of files after an index 
has been created. The document title is kept in a sep- 
arate text file, since users are likely to want to update 
these independently of the main text. 


5.3 Per-word Information 
For each unique word (that is, for each lemma), lq- 
text stores the information shown in Table 2. 










Field Bytes 
Word length l 

the Word itself (wordlength ) 
Overflow Offset 4 

No. of Occurrences 4 

Flags ] 





(Total) 10 + word—length 


Table 2: Per-word Information 








The word length and the word are stored only if the 
WordList feature is not disabled in the database con- 
figuration file. The Overflow Offset is the position in 
the data overflow file (described below) at which the 
data for this word begins. The remainder of the word 
index record is used to store the first few matches for 
this word. In many cases, it turns out that all of the 
matches fit in the word index, and the Overflow Off- 
set is set to zero. 

The flags are intended to be a bitwise ‘and’ of 
all the flags in the per-word entries described below, 
but this is not currently implemented, and the space is 
not reserved. When implemented, this will let /g-text 
detect whether all occurrences of a word have a given 
property, such as starting with a capital letter. This 
information can then be used to recreate the correct 
word form in generating reports, and when searching 
the vocabulary. 


5.4 Per-occurrence Information 
For each occurrence of each lemma (that is, for each 
match), the information shown in Table 3 is stored. 

















Field Size in Bytes 
FID (file identifier) 4 
Block Number 4 
Word In Block ] 
Flags ] 
Separation ] 


— 


Table 3: Per-occurrence Information 





Since, on average, English words are approxi- 
mately four characters long, and allowing one charac- 
ter for a space between words, one might expect the 
index to be approximately double the size of the data 
from the per-match information alone. Many com- 
mercial text retrieval systems do as badly, or worse 
[Litt94]. 
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In fact, the information is stored compressed. 
The matches are stored sorted first by FID and then 
by block within file, then by word in block. As a re- 
sult, for all but the first match it is only necessary to 
store the difference from the previous FID. Further- 
more, matches for a single FID are grouped together, 
so that it isn’t necessary to repeat the FID each time. 
The information ends up being stored as follows: 


AFID (difference from previous File IDentifier) 
Number of following matches using this FID 
Block In File 
Word In Block 
Flags (if present) 
Separation (if not 1) 


ABlock In File 

Word In Block 

Flags (if different from previous) 
Separation (if not 1) 


Storing lower-valued numbers makes the use of a 
variable-byte representation particularly attractive. 
The representation used in /q-text is that the top bit is 
set on all but the last byte in a sequence representing 
a number. Another common representation is to mark 
the first byte with the number of bytes in the number, 
rather like the Class field of an Internet address, but 
this means that fewer bits are stored in the first byte, 
so that there are many more two-byte numbers. 

The flags are stored only if they are different 
from the previous match’s flags. This is indicated on 
the disk by setting the least significant bit in the Word 
In Block value; this bit is automatically bit-shifted 
away on retrieval. Further, the separation is only 
stored if the flags indicate that the value is greater 
than one (whether or not the flags were stored expli- 
citly). The combination of delta-coding, variable- 
byte representation and optional fields reduces the 
size of the average match stored on disk from eleven 
to approximately three bytes. For large databases, the 
index size is about half the size of the data. Further- 
more, since /g-text has enough information stored that 
it can match phrases accurately without looking at the 
original documents, it is reasonable to compress and 
archive (in the sense of ar(1)) the original files. 
When presenting results to the user, /q-text will fetch 
the files from such archives automatically. 


6. Programs and Algorithms 

This section describes the programs used first to cre- 
ate and update /g-text indexes, and then to retrieve 
data, and outlines the main algorithms used. 
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6.1 Indexing 

Storing the Per-File information: Each input file is 
read a word at atime. A word is defined to be a Word- 
Start character followed by zero or more WithinWord 
characters. In addition, a character of type Only- 
WithinWord may occur any number of times within a 
word, but if it occurs two or more times in a row, the 
first occurrence is taken as the last character in the 
word. This allows for the apostrophe in possessives 
(the James’) as well as in such words as can’t. Words 
shorter than MinWordLength are rejected, and the next 
word read successfully will have the WPF_LASTHAD- 
LETTERS flag set. Words longer than MaxWordLength are 
truncated to that length. In addition, if any punctua- 
tion was skipped in looking for the start of the word, 
the WPF_UPPERCASE flag is set on this word. 

The word is then looked up to see if it was in 
the common words file. If so, it is rejected and the 
next successfully read word will have the WPF_LAST- 
WASCOMMON flag set. In addition, whenever the start of 
this word is more than one character beyond the start 
of the previous successfully read word—as in the 
case that a common word, or extra space or punctua- 
tion, was skipped—the next successfully read word 
will have the WPF_HASSTUFFBEFORE flag set, and the dis- 
tance will be stored in the Separation byte shown in 
Table 3. 

Common word look-up uses linear search in 
one of two sorted lists, depending on the first letter of 
the word. Using two lists doubled the speed of the 
code with very little change, but if more than 50 or so 
words are used, the common word search becomes a 
significant overhead; a future version of /q-text will 
address this. 

After being accepted, the word is passed to the 
stemmer. Currently, the default compiled-in stemmer 
attempts to detect possessive forms, and to reduce 
plurals to the singular. For example, feet is stored as a 
match for foot, rather than in its own entry; there will 
be no entry for feet. When the stemmer decides a 
word was possessive, it removes the trailing apostro- 
phe or ’s, and sets the WPF_POSSESSIVE flag. When a 
plural is reduced to its singular, the WPF_WASPLURAL flag 
is set. 

Other stemming strategies are possible. The 
most widespread is Porter’s Algorithm, and this is 
discussed with examples in the references [Salt88], 
(Frak92]. Such an alternate stemmer can be specified 
in the configuration file. Porter’s algorithm will con- 
flate more terms, so that there are many fewer sepa- 
rate index entries. Since, however, the algorithm does 
not attempt etymological analysis, these conflations 
are often surprising. 

The converted words are then mapped to a WID 
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as described above, and stored in an in-memory sym- 
bol table. Whenever the symbol table is full, and also 
at the end of an index run, the pending entries are ad- 
ded to the index, appending to the linked-list chains 
in the data file. The ability to add to the data for an 
individual word at any time means that an /g-text in- 
dex can be added to at any time. 

Compression: As mentioned above, numbers written 
to the index are stored in a variable-byte representa- 
tion. In addition, the numbers stored are the differ- 
ence between the current and previous values in a se- 
quence. 
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Figure 1: Vocabulary Distribution 


Figure 1 shows the vocabulary distribution graph for 
the combined index to the King James Bible, the New 
International Version, and the complete works of 
Shakespeare, a total of some 15 megabytes of text 
(the NIV is in modern English; the other two are in 
16th-century English). It can be seen from the graph 
that a few words account for almost all of the data, 
and almost all words occur fewer than ten times. The 
frequency f of the nth most frequent word is usually 
given by Zipf’s Law: 

SHRM «s&s in we ow 
where k, m and s are nearly constant for a given col- 
lection of documents [Zipf49], [Mand53]. As a result, 
the optimisation whereby /q-text packs the first half 
dozen or so matches into the end of the fixed-size 
record for that word, filling the space reserved for 
storing long words, is a significant saving. On the 
other hand, the delta encoding gives spectacular sav- 
ings for those few very frequent words: ‘the’ occurs 
over 50,000 times in the SunOS manual pages, for ex- 
ample; the delta coding and the compressed numbers 
reduce the storage requirements from 11 to just over 


three bytes, a saving of almost 400,000 bytes in the 
index. Although Zipf’s Law is widely quoted in the 
literature, the author is not aware of any text retrieval 
packages that are described as optimising for it in this 
way. 


6.2 Retrieval 

This section describes the various programs used to 
retrieve information from the index, and some of the 
algorithms and data structures used. 

Simple Information: The Igfile program can list in- 
formation about files in the index. More usefully, 
[qword can list information about. words. For ex- 
ample, the script 

Iqword -A | awk '{print $6}' | sort -n 
was sufficient to generate data for a grap(1) plot of 
word frequencies shown above (see Figure 1). 

It is also possible to pipe the output of /gword 

into the retrieval programs discussed below in order 
to see every occurrence of a given word. 
Processing a Query: A query is currently a string 
containing a single phrase. The database is searched 
for all occurrences of that phrase. In order to process 
a query, a client program first calls LOT _StringTo- 
Phrase() to parse the query. This routine uses the 
same mechanisms to read words from the string as the 
indexer (/gaddfile), and the same stemming is per- 
formed. 

The client then calls LQT MakeMatches() , which 
uses the data structure to return a set of matches. A 
better than linear time algorithm, no worse than 
O(total number of matches) is used; this is outlined 
in Algorithm 1, and the data structure used is illus- 
trated in Figure 2. 

This appears O(m”) at first sight, if there are w 

words in the phrase each with w matches, but the 
search at [3] resumes at the ‘current’ pointer, the 
high-water mark reached for the previous word at [1]. 
As a result, each match is inspected no more than 
once. For a phrase of three or more words, the 
matches for the last word of the phrase are inspected 
only if there is at least one two-word prefix of the 
phrase. As a consequence, the algorithm performs in 
better than linear time with respect to the total num- 
ber of matches of all of the words in the phrase. In 
addition, although the data structure in the figure is 
shown with all of the information for each word al- 
ready read from disk, the algorithm is actually some- 
what lazy, and fetches only on demand. 
Sample Program: The lgphrase program takes its ar- 
guments one at a time, treats each as a query, and 
processes it in turn as described. The main part of the 
source for /gphrase is shown in the Appendix, to il- 
lustrate this part of the C API. 
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[1] For each word in the first phrase { 
[2] for each subsequent word in the phrase { 
[3] is there a word that continues the match 
rightward? { 
Starting at current pointer, scan forward for a 
match in the same file; 
Continue, looking for a match in the same 
block, or in an adjacent block; 
Check that the flags are compatible 
If no match found, go back to [1] 
Else, if we’re looking at the last word { 
Accept the match 
} else { 
continue at [2] 


} 
} 
} 
} 


Algorithm |: Phrase matching 


6.3 Ranking of Results 

A separate program, /grank, combines sets of results 
and sorts them. Currently, only boolean ‘and’ and 
‘or’ are available. Quorum ranking, where docu- 
ments matching all of the phrases given are ranked 
above those that match all but n phrases, will be in the 
next release. 

Statistical ranking—for example, where docu- 
ments containing the given phrases many times rank 
more highly than those containing the phrases only 
once—is also planned work. See [Salt88]. Statistical 
ranking and document similarity, where a whole doc- 
ument is taken as a query, should also take document 
length into account, however; this is an active re- 
search area [Harm93]. 

The initial implementation of /grank used sed 
and fgrep. This was improved by Tom Christiansen to 
use perl, and then coped with larger results sets (fgrep 
has a limit), but was slower. 

The current version is written in C. For ease of 
use, /grank can take phrases directly, as well as lists 
of matches and files containing lists of matches. 

The algorithms in /grank are not unusual; the 
interested reader is referred to the actual code. 


6.4 Presentation of Search Results 
The retrieval programs discussed so far—Ilgword, 
Iqphrase, and lqgrank—return an ASCII match format 
as follows: 

[1] Number of words in query phrase 

[2] Block within file 

[3] Word number within block 
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[4] File Number 

[5] File Name and Location 
(Items [4] and [5] are optional, but at least one must 
be present.) 

In itself, this is often useful information, but the 
programs described below—/gkwic and Iqgshow—can 
return the corresponding parts of the actual docu- 
ments. 

Key Word In Context listings: The lqkwic program 
fetches the data from the original documents referred 
to by the given matches. It presents the results in the 
format used by a permuted, or ‘key word in context’, 
index. /gkwic has a built-in little language [Bent88] to 
control the formatting of the results, and uses lazy 
evaluation to maximise its efficiency. This program 
can be used to generate SGML concordances, for ex- 
ample, or even simply to expand the file name in each 
match into an absolute path. See the Appendix for 
sample /gkwic output. 

Converting to line numbers: For use with editors and 
pagers such as ex, nvi, less and more, the lqbyteline 
program converts matches to (file, line-number) pairs. 
Unfortunately, although all of these editors and 
pagers understand a +n file option to open the named 
file at the given line number, the options applies only 
to the first file opened subsequent files are opened at 
the first line (vi goes so far as to convert subsequent 
+n options to —n options, but then tries to edit a file of 
that name). 

Text In Place: lqshow is a curses-based program to 
show part of a file, with the matched text highlighted. 


6.5 Combined Interfaces 
Two interfaces to /g-text are currently included in the 
distribution. These allow a combination of searching 
and browsing in a single interactive session. 
lqtext: This is a simple curses-based front end. 
lq: This is a shell script that combines all of the /q- 
text programs with a simple command language. It 
requires the System V shell (with shell functions and 
the ability to cope with an 815-line shell script!). 
The original purpose of /q was to demonstrate 
the use of the various /q-text programs, but lq is 
widely used in its own right. A brief example of us- 
ing /q to generate a keyword in context listing from 
the netnews news.answers newsgroup is shown in the 
Appendix. 


7. Performance 

This section describes some work that was done to 
measure and improve the performance of /q-text, and 
then gives some actual measurements and timing 
comparisons with other systems. 
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7.1 Profiling 

Iq-text originally took over 8 hours to index the King 
James Bible on a 25 MHz under 386/ix. Extensive 
profiling, and careful tuning of cache algorithms, im- 
proved performance dramatically: the time to index 
the Bible has been reduced to under five minutes. 
Function Calls: Although most C compilers have a 
fairly low function call overhead these days, it’s still 
not trivial. Functions called for every character of the 
input were folded inline, and those called for every 
word were made into macros in many cases. Under- 
standing why each function was called the number of 
time it was proved a big help both in speeding up the 
programs and in debugging them. 

At one point, /gaddfile was spending over 40% 
of its time in stat(2). It turned out that it was opening 
and closing an ndbm database for every word of in- 
put, which was suboptimal. 

Now, most of the routines spend more than half 
of the time doing I/O, and no single function accounts 
for more than 10% of the total execution time. 


7.2 Timings 

The performance of /q-text is compared with SunOS 
4.1 grep, GNU grep (ggrep), and Udi Manber’s agrep. 
The agrep timings reflect only the simplest use of that 
program, since the goal was to generate comparable 
results. For /g-text, the time to build the index is also 
reported. Recall that the index only needs to be built 
once. 

The following searches were timed; since the 
results for the various forms of grep were always very 
similar for any given set of files, the grep timings are 
only given once for each collection of data. 

0. Not There: something not in the index at all, a non- 
sense word; the time was always 0-0 for /q-text, as re- 
ported by time(1), irrespective of the size of the data- 
base. This timing is therefore omitted from the table. 
1. Not Found: a phrase made up of words that do oc- 
cur, but not in the order given (if ‘gleeful’ and “boy’ 
each occur, but ‘gleeful boy’ does not, ‘gleeful boy’ 
would be such a search). 

2. Common: a phrase that occurs infrequently, but in- 
cludes a relatively frequent word. 

3. Unusual: a word or phrase that occurs infrequently 

The following corpora were used: 

Man Pages: The on-line manuals from SunOS 4.1, a 
total of twelve megabytes. 

Bibles: The King James and New International 
bibles, and the Moby Complete Works of Shak- 
espeare, a total of over 15 megabytes. 

FAQ: The netnews news.answers newsgroup of ap- 
proximately 1150 articles, totalling over 30 mega- 
bytes. 

Timing Environment: A SPARCstation 10/30 (1 pro- 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 





Table 4: Timings 


Size Creation Time 
Data Index _ real user sys 


12M 6:5M 267-4 124-9 64-5 
15M 6-6M 351-3 136-6 51-0 
41M 205M 1598-8 851-9 375-8 





Table 5: Index Statistics 


cessor, 30 MHz, no cache) with 64 MBytes of mem- 
ory was used for the timings. The system was not 
equipped with Wide scsI disks. The timings are 
given in real time, using time(1), as this is the most 
important in practice. Each timing was performed 
several times, and an average taken of all but the first. 
This favours the grep algorithms somewhat, since it 
reduces the impact of the I/O that they do. 

The /q-text timings do not include the time to 
produce the text of the match, for example with 
Igkwic. However, running /gkwic added less than one 
second of run-time for any except the very large 
queries, even when the data files were accessed over 
NFS. 

The index overhead is approximately 50% of 

the size of the original data. This can be controlled to 
some extent using stop-words; the news.answers data- 
base used 79 stop-words, reducing the database by 
about 2 Megabytes. In addition, single-letter words 
were not indexed, although the presence of a single- 
letter word was stored in the per-word flags. The 
other databases used no stop words, and indexed 
words from 1 to 20 characters in length—the differ- 
ences are because the FAQ index is accessed by a 
number of other users on our site. 
Results: As one would expect, /q-text easily out-per- 
forms the grep family of tools. For queries producing 
a lot of matches, such as ‘and the’ (1790 occurrences 
in the SunOS manual pages), the time taken to print 
the matches dominates the run-time of /gphrase. 


8. Ongoing and Future Work 
This section describes speculative, planned and ongo- 
ing work. 
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8.1 The C api 

The /q-text libraries (liblgerror, liblqutil and li- 
blqtext itself) each provide a clear set of functions 
forming an Application Programmer’s Interface (ap!). 
The process of tidying up the Api is under way: 
Documentation: The api is currently documented 
only by function prototypes in header files and by ex- 
amples. Clearly this needs to change. 
Completeness: The Api isn’t complete yet. For ex- 
ample, in liblgerror there’s an Eopen() function 
which works like Unix open(2), except that it pro- 
vides error messages and can be made to exit on error. 
However, there is no Eclose() function yet. 
Consistency: The structure of the Api needs to be 
clear enough that one would be able to guess which 
library contains any given function; this is largely but 
not completely true now. Almost all functions have a 
prefix, such as LQU_ in LQT_ObtainWriteAccess() , for 
example, for functions from liblgtext. A very few 
functions don’t do this, and a few others are actually 
defined in client programs rather than in the library. 


Configuration and Testing: Configuration is cur- 
rently a case of editing a Makefile and aC header file, 
but several people have asked for something like the 
GNU auto-configuration package. 

An ad hoc test suite is included with the /q-text 
distribution, but this needs to be made more formal, 
and to be run automatically when the software is 
built. 


8.2 A User Interface 

lq-text is primarily a text retrieval engine suitable for 
integration into other systems. However, experimen- 
tal user interfaces have proved popular, and it is cer- 
tainly expected that better interfaces will be provided 
in the future. 

X11 interface: An X11 client based on the Fresco 
toolkit is planned, building on the work of Marc 
Chignel [Golo93], Ed Fox et al. [Fox93] and others. 
However, this work is awaiting the distribution of the 
Fresco toolkit with X11R6. 


8.3 Functionality 

In addition to the user interface, there are some spe- 
cific features that are wanted: 

Approximate matching: currently, /q-text can per- 
form egrep-style matches against the vocabulary in 
the index; it would be interesting to extend this to 
agrep-style approximate patterns, and to integrate it 
into the main query language, so that 

“core /*dump.*~/’’ might match ‘core dumped’, using 
approximate matching only for the second word in 
the phrase. 

Complex queries: It it desired to support queries that 


are themselves complex, or that refer to the structure 
of documents stored marked up in SGML format 
[Stan88], perhaps building on the work of Forbes Bur- 
kowski [Burk92]. Allowing a more complex syntax in 
a query has to be done carefully, so that the language 
is both straightforward and general. Handling struc- 
tured documents also entails an extended query 
parser. At the same time, Fuzzy Logic [Zade78] and 
limited recognition of anaphoristic references is pro- 
ceeding. It may also be possible to perform experi- 
ments in clustering, in the manner of some of the re- 
cent work at Xerox [Cutt93]. 

Performance: Although /q-text is already pretty fast 
at both retrieval and indexing, it could certainly be 
made faster. Experiments with mmap(2) and with al- 
ternate cache algorithms are ongoing. 

Run-time configuration: New parameters will in- 
clude user-defined stemming (perhaps using stem- 
ming algorithms described by W. Frakes in [Frak92]), 
and allowing a partial (document-vector) index. 
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10. Conclusions 

The /g-text package is freely available, and hence 
clearly meets the criterion of low cost given at the 
start of the Introduction above. The largest database 
indexed by the author at the time of writing (March 
1993) occupies about 100 Megabytes, and it remains 
to be determined how suitable /q-text is at indexing 
and searching larger bodies of text, which was the 
second main goal given in the Introduction. The 
package does provide fast retrieval, and meets all of 
the Design Goals given above except for the abilities 
to unindex and update documents. These last two 
features are expected in the summer of 1994. 

The source for /g-text is available for anony- 
mous ftp from ftp.cs.toronto.edu in /pub. Updates 
are announced on a mailing list, lq-text-request@ 
sq.com. 
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12. Appendix 1—Source for /gphrase 


PRIVATE void 
MatchOnePhrase ( Phrase) 
char *Phrase; 

{ 


t_Phrase *P; 


if (!Phrase || !*Phrase) { 
/* ignore an empty phrase */ 


return; 

} 

if ((P = LQT StringToPhrase (Phrase)) == (t_Phrase *) 0) { 
/* not empty, but contained no plausible words */ 
return; 


/* PrintAndAcceptOneMatch() is a function that prints 
* a single match. It is called for each match as soon as it is 
* read from the disk. This means that results start appearing 
* immediately, a huge benefit in a pipeline. 
*/ 

if (LQT MakeMatchesWhere (P, PrintAndAcceptOneMatch) <= OL) { 

return; 
} 
} 


13. Appendix 2—Sample /q session 
This listing shows part of a session using /q, a shell-script that uses /q-text. The fag command invokes /q after setting up 
environment variables and options to use the news.answers database. 


sqrex!lee; faq 
Using database in /usr/spool/news/faq.db... 
| Type words or phrases to find, one per line, followed by a blank line. 
| Use control-D to quit. Type ? for more information. 
> text retrieval 
> 
Computer Science Technical Report Archive Sites == news/answers/17480 
1:v.edu> comments: research reports on text retrieval and OCR orgcode: ISRI 
Interleaf FAQ -- Frequently Asked Questions for comp.text.interleaf == news/answers/17607 
2:g with hypertext navigation and full-text retrieval. 1.2. What platform 
alt.cd-rom FAQ == news/answers/17643 
3:and Mac. 56. Where can I find a full text retrieval engine for a CDROM Ia 
4;======== 56. Where can I find a full text retrieval engine for a CDROM Ia 
5: I am making? Here is a list of Full-Text Retrieval Engines from the CD-PU 
OPEN LOOK GUI FAQ 02/04: Sun OpenWindows DeskSet Questions == news/answers/18078 
6:O0OK/XView/mf-fonts FAQs;lq-text unix text retrieval who is my neighbour? 
OPEN LOOK GUI FAQ 04/04: List of programs with an OPEN LOOK UI == news/answers/18079 
7: Description: Networked, distributed text-retrieval system. OLIT-based fr 
8:OOK/XView/mf-fonts FAQs;lq-text unix text retrieval who is my neighbour? 
[comp.text.tex] Metafont: All fonts available in .mf format == news/answers/18080 
9:00K/XView/mf-fonts FAQs;lq-text unix text retrieval who is my neighbour? 
OPEN LOOK GUI FAQ 03/04: the XView Toolkit == news/answers/18082 
10:00K/XView/mf-fonts FAQs;lq-text unix text retrieval who is my neighbour? 
Catalog of free database systems == news/answers/18236 
11:------------------ name: Liam Quin’s text retrieval package (lq-text) vers 
12: are bugs. description: lq-text is a text retrieval package. That means 
| Type words or phrases to find, one per line, followed by a blank line. 
| Use control-D to quit. Type ? for more information. 
> :less 9 (brings up the 9th match in a pager) 
> :help 
| Commands 
| 
| 


shelp 
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Shows you this message. 
sview [n,nl-n2,n...] 
Use :view to see the text surrounding matches. 
The number n is from the left-hand edge of the index; 
:set maxhits explains ranges in more detail. 
spage [n,nl-n2,n...] 
Uses less -c (which you can set 
in the $PAGER Unix environment variable) to show the files matched. 
sless [n,ni-n2,n...] 
The same as :page except that it starts on the line number 
containing the match, firing up the pager separately on 
each file. 
sindex [n,nl-n2,n...] 
This shows the indexfor the phrases you’ve typed. 
:files [n,nl-n2,n...] 
This simply lists all of the files that were matched, 
in ranked order. 
:prefix prefix-string 
Shows all of the words in the database that begin with that prefix, 
together with the number of times they occur. 
s:grep egrep-pattern> 
Shows.all of the words in the database that match that egrep pattern, 
together with the number of times they occur (not the number of files 
in which they appear, though). 
:set option 
Type :set to see more information about setting options. 
:shell [command ...] 
Use /home/sqrex/lee/bin/sun4os4/msh to run commands. 


Commands that take a list of matches (n,nl-n2,n...) 
only work when you have generated a list of matches. If you don’t give 
any arguments, you get the whole list. 


| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
>  :set 
| :set match [precise| heuristic | rough] 
| to set the phrase matching level, 
| currently heuristic matching (the default) 
| :set index 
| to see a Key Word In Context (KWIC) index of the matches 
| :set text 
| to browse the actual text that was matched 
| :set database directory 
| to use the database in the named directory. 
| Current value: /usr/spool/news/faq.db 
| :set rank [all|most|any] 
| all presents only documents containing all the phrases you typed; 
| most shows those first, and then documents with all but one of 
| the phrases, and so on. 
| any doesn’t bother sorting the matches; this is the default, 
| because it’s the fastest. 
| Currently: find documents containing all of the phrases 
| :set tags string 
| determine whether SGML tags are shown or hidden in the KWIK index 
| :set tagchar string 
| set the character used as a replacement for hidden SGML tags 
| :set maxhits n|all 
| show only the first n matches (currently 200) 
| :set prompt string 
| set the prompt for typing in phrases to string 
> 7C 
Interrupted, bye. 

sqrex! lee; 
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Abstract 


In this paper, we demonstrate a technique called ac- 
tive probing used to study TCP implementations. Ac- 
tive probing treats a TCP implementation as a black 
box, and uses a set of procedures to probe the black 
box. By studying the way TCP responds to the probes, 
one can deduce several characteristics of the imple- 
mentation. The technique is particularly useful if TCP 
source code is unavailable. 


To demonstrate the technique, the paper shows ex- 
ample probe procedures that examine three aspects of 
TCP. The results are informative: they reveal imple- 
mentation flaws, protocol violations, and the details of 
design decisions in five vendor-supported TCP imple- 
mentations. The results of our experiment suggest that 
active probing can be used to test TCP implementa- 
tions. 


1 Introduction 


The Transmission Control Protocol (TCP) is 
a connection-oriented, flow-controlled, end-to-end 
transport protocol that provides reliable transfer and 
ordered delivery of data [14]. TCP is designed to op- 
erate successfully over communication paths that are 
inherently unreliable (i.e., they can lose, damage, du- 
plicate, and reorder packets). The ability of TCP to 
adapt to networks of various characteristics and com- 
puter systems of various processing power makes TCP 
an important component in the fast expansion of the 
global Internet. 


The original definition of TCP appears in RFC-793 
[14]. Many researchers [2, 7, 8, 9, 11, 12, 18, 19] have 
identified problems and weakness of the protocol, and 
proposed solutions. RFC-1122 [1] updates and sup- 
plements the definition; to meet the TCP standard, an 
implementation must follow both RFC-793 and RFC- 
1122. 


“This work was supported in part by a fellowship from UniForum. 
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Although RFCs 793 and 1122 give a detailed de- 
scription of TCP implementation, two TCP implemen- 
tations that conform to the specifications can differ 
slightly because an implementor has some freedom to 
choose a software design, parameters, and to interpret 
the protocol standards. Although it is possible to de- 
duce design decisions and parameters choices from the 
source code, understanding the operation of a complex 
software module like TCP can be difficult. In this pa- 
per, we demonstrate a technique called active probing 
used to study TCP implementations. Active probing 
is especially useful when source code is unavailable. 
Furthermore, it shows how the TCP code operates in 
the presence of other system components. 


Active probing treats a TCP implementation as a 
black box and uses a set of procedures to probe the 
black box. By studying the way TCP responds to 
the probes, one can deduce characteristics of the im- 
plementation. The information that can be deduced 
depends on the probing procedures used. In this paper, 
we show three example procedures that examine three 
aspects of TCP. The results are informative: they re- 
veal implementation flaws, protocol violations, and the 
details of design decisions in commercially available 
TCP implementations. The results of the experiment 
suggest that active probing can also be used to test TCP 
implementations. 


Active probing operates much like traditional TCP 
trace analysis. It uses a software tool to capture TCP 
segments directed toward a particular TCP implemen- 
tation as well as segments the TCP implementation 
sends in response. It then analyzes the trace data to 
find patterns that reveal characteristics of the TCP im- 
plementation. Unlike trace analysis, however, active 
probing uses specially designed probing procedures to 
induce TCP traffic instead of passively monitoring nor- 
mal traffic on the network. 


The software tools used to capture TCP segments 
and to assist in the analysis of the trace data are widely 
available, both in public domain and in commercial 
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domain. RFC-1470 [4] gives a detailed catalog of such 
tools. All experiments reported in this paper use the 
tools from NetMetrix [6] to capture the TCP segments 
and to assist in the analysis of the trace data; we also 
wrote C programs to parse and analyze the the captured 
data. | 


The experiments reported in this paper examine 
commercially available TCP implementations: Solaris 
2.1, SunOS 4.1.1, SunOS 4.0.3, HP-UX 9.0, and IRIX 
5.1.1. We chose these implementations because they 
are widely available in workstation operating systems. 
We only have the access to the source code of SunOS 
4.0.3 and SunOS 4.1.1. 


The remainder of this paper is organized as follows. 
Section 2 examines TCP retransmission time-out in- 
tervals for successive retransmission of a single data 
segment. Section 3 studies the keep-alive mechanism 
in some TCP implementations. Section 4 investigates 
TCP zero-window probing. Finally, section 5 draws 
conclusions and discusses future work. 


2 Successive Retransmission Intervals In 
TCP 


TCP uses an acknowledgment and retransmission 
scheme to ensure the reliable delivery of packets. 
When sending a packet, the sender starts a timer and 
expects an acknowledgment from the receiver within 
a retransmission time-out (RTO) period. If the sender 
does not receive an acknowledgment in that period, it 
assumes the packet was lost and retransmits the packet. 
The correct estimation of the retransmission time-out 
is vitally important to provide effective data transmis- 
sion and avoid overwhelming the Internet by excessive 
retransmissions [11]. On one hand, if the sender uses 
a smaller RTO value than the actual packet round-trip 
time (RTT), unnecessary retransmissions occur. More- 
over, if the packet round-trip time increase is due to net- 
work congestion, unnecessary retransmissions make 
the situation even worse and may lead to congestion 
collapse [12]. On the other hand, if the sender uses 
a larger RTO value, a lost packet causes the sender to 
wait longer than necessary, thus degrading throughput. 


The calculation of the RTO value originally sug- 
gested in RFC-793 is now known to be inadequate and 
has been replaced. RFC-1122 specifies the new stan- 
dard, which uses an algorithm described in Jacobson 
[8]. The new algorithm uses the measured RTT val- 
ues to calculate a smoothed mean and a measure of 
the variance using a smoothed mean difference. The 
RTO is then calculated from the smoothed mean and 
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Figure 1: Configuration of networks and hosts to obtain 
successive retransmissions intervals in TCP 


the variance. RFC-1122 specifies that TCP must im- 
plement this algorithm and must exponentially increase 
the RTO values for successive retransmissions of the 
same segment. 


2.1 Probing Procedure 


To determine how a TCP implementation chooses 
RTO values for successive retransmissions, we use the 
following probe procedure: 


1. From a host to be tested, 7, select a multi-homed 
host!, H, as the destination (see Figure 1). 


2. Let the IP address of one interface on H, say A, be 
the destination address that can be reached by 7. 


3. From 7, open a TCP connection to the discard port 
[16] of host H via interface A, and start sending 
data. 


4. Login to host H from a control host, C, via another 
interface, say B. 


5. Disable interface A while the communication be- 
tween host 7 and host H is in progress’. 


Disabling interface A while host T is sending data 
to the discard port of host H via interface A simulates 
a network failure between host 7 and host H, and it 
triggers retransmissions from 7. Note that T runs the 


!'A multi-homed host is a host that connects to more than two 
networks. 

2We used the UNIX command ifconfig to disable the 
interface. 
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Figure 2: The TCP RTO intervals for successive retransmissions in a LAN environment 


probe program; it is also the host on which TCP is 
being probed. To enable continued control of H while 
interface A is down, one must login to H from a control 
host (C) via interface B. C and interface B are connected 
to the same Ethernet. 


Because the RTO estimate depends on the packet 
round trip time between a tested host and host H, all the 
TCP implementations tested run on hosts connected to 
10 megabit per second (Mbps) Ethernets. The average 
load on the Ethernets during the experiment is less 
than 10% of capacity. The tested hosts are located at 
most one gateway from H (see Figure 1). The average 
round trip time of packets between a tested host and 
H during the experiments, measured using ping, is at 
most 10 ms. To make the measurements more accurate, 
the monitor program that captures the TCP segments 
always runs on a host connected to the same Ethernet 
as the hosts being probed. (The monitor program runs 
on host M, or M2 depending on which host is being 
probed.) 


2.2 Results 


For each TCP implementation, we conducted 30 ex- 
periments; Figure 2 shows the results. As the graphs 
in Figure 2 show, four of the probed operating sys- 
tems, SunOS 4.1.1, SunOS 4.0.3, HP-UX 9.0, and IRIX 
5.1.1, behave the same. Each increases the RTO val- 
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ues exponentially on successive retransmissions until it 
reaches a maximum RTO of 64 seconds. Each retrans- 
mits the same data segment twelve times; at the thir- 
teenth transmission, each sends a reset (RST) segment 
(without data), drops the connection, and terminates 
the process that executes the probe program. 


Solaris 2.1 TCP increases the RTO values for succes- 
sive retransmissions and drops the connection after the 
ninth retransmission. The Solaris TCP does not send a 
RST segment after the ninth retransmission. However, 
it delays for 62.2 seconds*before it drops the connec- 
tion and terminates the process that executes the probe 
program. 


RFC-1122 specifies a threshold, R2, for dealing with 
excessive retransmissions of the same segment by TCP. 
R2 can be measured in units of time or as a count of re- 


3 Obtained by using > Pires (pi — qi) /30, where p; is the interval 
between the instance at which the probe program calls a connect 
routine to establish a connection and the instance at which the process 
that runs the probe program (called it P) exits in the t-th experiment, 
and q; is the interval between the instance at which TCP sends the 
first segment and the instance at which TCP sends the last segment 
as measured by the monitor program in the 7-th experiment. The 
ume interval, p;, consists of three parts: a, g;, and @, where a is 
the interval between the instance at which the probe program calls 
connect and the instance at which the first segmentis sent, and (3 is 
the interval between the instance at which the last segment sent and 
the instance at which process P exits. Because a is small compared 
to B, py — qj is an approximation of (2. 
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Figure 3: The initial RTO values in TCP implementations in a LAN environment 


transmissions. When the number of retransmissions of 
the same segment reaches R2, TCP closes the connec- 
tion. RFC-1122 specifies that R2 should correspond to 
at least 100 seconds. All the implementations probed 
meet the requirement. However, no implementation al- 
lows users to configure the value for R2 as RFC-1122 
mandates. 


The initial RTO values in TCP implementations are 
worth noting. In a local area network (LAN) envi- 
ronment that consists of 10 Mbps Ethernet segments 
with a average load of less than 10% of the available 
bandwidth, typical packet RTTs average less than 20 
ms, and the variance (smoothed mean difference) of 
the packet RTTs averages less than 10 ms. So, a typ- 
ical RTO value calculated from mean plus variance 
will remain under 100 ms. Figure 3 shows that the 
initial RTO values used by TCP implementations are 
all much higher than 100 ms. The large initial RTO 
values suggest that the implementations have imposed 
a lower bound on the RTO estimates. 


2.3 The Lower Bound on RTO Estimates 


There are two reasons for imposing a lower bound 
on the RTO estimates. First, the timer used to measure 
packet RTT may be too coarse for accurate measure- 
ments. For example, the 4.3BSD TCP (and most of its 
derivatives) uses a timer of 500 ms per tick to measure 
the packet round trip time and to schedule retransmis- 
sions [10]. In a LAN environment with typical packet 
RTT less than 20 ms, using such a timer to measure 
packet RTT accurately is impossible. Thus, a lower 


bound filters out the RTT samples that are too small to 
measure accurately with a coarse granularity timer. 


Second, imposing a lower bound on RTO estimates 
can improve throughput in a LAN environment. A 
LAN environment exhibits low packet loss and low 
average packet round trip time. Imagine a TCP im- 
plementation that uses a millisecond granularity timer 
to measure packet round trip time and to schedule re- 
transmissions without imposing a lower bound on RTO 
estimates. Under normal load conditions, the smoothed 
RTT will be less than 10 ms and the variance (smoothed 
mean difference) is less than 5 ms. A sudden network 
delay or host processing delay that causes the RTT of a 
segment to exceed 20 ms* will cause a retransmission 
of that segment even though the segment is not likely 
to be lost in transit. The redundant retransmission not 
only consumes network bandwidth and adds unneces- 
sary processing overhead to the sender and receiver, 
but also forces the sender to a slow start mode [8] that 
reduces its transmission rate. 


Another way of viewing the lower bound on the RTO 
estimates is to consider it a threshold for the RTO esti- 
mation algorithm to take effect. If the lower bound is 
set to infinity, TCP ignores the RTO estimates entirely 
(TCP makes no attempt to retransmit lost packets); if 
the lower bound is set to zero, TCP uses the RTO esti- 
mates for each transmission. Because the RTO estima- 
tion algorithm derives an estimate of future RTO from 
the previous RTT samples, it can only cover the fluc- 
tuations of packet RTT within a specific range. Any 


4We calculate RTO as mean plus twice the variance. 
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Segment Host A 
Nuinber _— (Solaris 2.1) Host H Comment 
992 §:2473 D:488 A:2002 W:9112 . 
993 $:3985 D:536 A:2002 W:9112 ——> ERROR! sequence # should be 2961 


994 — 
995 $:4521 D:448 A:2002 W:9112. ——> 
996 $:5009 D:536 A:2002 W:9112. ——— 
997 S:5545 D:448 A:2002 W:9112 —— 
99g $:6003 D:536 A:2002 W:9112. —— 
999 ene 
1000 $:2961 D:536 A:2002 W:9112 —— 
1001 —— 


1002 $:3497 D:488 A:2002 W:9112 ——> 


1003 <——. 


§:2002 D:0 A:2961 W:3608 


$:2002 D:0 A:2961 W:4096 


(Transmission of the missing data) 


$:2002 D:0 A:3497 W:4096 


(Transmission of the missing data) 


$:2002 D:0 A:6569 W:4096 


S: Sequence number, D: Number of data octets, A: Acknowledgment number, W: Window 
Note: Only the last four digits of the sequence number and acknowledgment number are shown. 


Figure 4: Illustration of an implementation flaw in Solaris 2.1 TCP. 


sudden RTT fluctuations that exceed that range will 
trigger unnecessary retransmissions. On one hand, us- 
ing ahigher lower bound allows TCP to tolerate greater 
network delay fluctuations without triggering unneces- 
sary retransmissions; but it makes TCP take longer 
to respond to lost packets. On the other hand, using 
a lower lower bound allows TCP to respond to lost 
packets quickly, but it may cause unnecessary retrans- 
missions when network delay fluctuations exceed RTO 
estimations. Therefore, the lower bound on RTO es- 
timates is a design parameter a TCP implementation 
must choose carefully. 


As Figure 3 shows, the lower bound on the observed 
systems is a range of values’. IRIX 5.1.1 TCP has the 
largest lower bound (in the range of 1000 ms to 1500 
ms) and Solaris TCP has the smallest lower bound (in 
the range of 200 ms to 400 ms). SunOS 4.1.1, HP-UX 
9.0, and SunOS 4.0.3 has the lower bound set in the 
range of 500 ms to 1000 ms. 


2.4 Implementation Flaw Found 


In analyzing the probe results for Solaris 2.1 TCP, 
we have found an apparent implementation flaw. The 
symptom occurs in all 30 instances of TCP trace data 
we gathered. As Figure 4 illustrates, host A, running 
Solaris 2.1, sends data to the discard port of host H. 
Segment #992 has sequence number 2473 and car- 
ries 488 octets of data. The next data segment from 
A should have sequence number 2961 (2473+488). 


5In reading the SunOS 4.1.1 and 4.0.3 TCP source code, we 
found that the inaccuracy in the timer algorithm for scheduling re- 
transmissions can cause the lower bound on RTO to be a range of 
values. 
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Instead, segment #993 has sequence number 3985 
(2473+488+1024). Apparently TCP has skipped 1024 
octets in the sequence space! After 234 milliseconds, 
A transmits the missing 1024 octets of data in segment 
#1000 and segment #1002. 


Note that the monitor program runs on host M2 con- 
nected to the same Ethernet as A. Thus, the missing 
segments are not discarded by a gateway. Further- 
more, the retransmissions of the missing data segments 
in segments #1000 and #1002 show that the error did 
not result from the monitor program missed the origi- 
nal transmissions. The same symptom also occurs in 
10 of the 30 instances of the IRIX 5.1.1 trace data. 


3 TCP Keep-alives 


The TCP specification does not include a mecha- 
nism for probing idle connections. In theory, if a host 
crashes after establishing a connection to another host, 
the second machine will continue to hold the idle con- 
nection forever. Some TCP implementations include a 
mechanism that tests an idle connection and releases it 
if the remote host has crashed. Called TCP keep-alive, 
the mechanism periodically sends a probe segment to 
elicit response from the peer. If the peer responds to 
the probe by sending an ACK, the connection is alive. 
If the peer TCP fails to respond to probe segments for 
longer than a fixed threshold, the connection is declared 
down and the connection is closed. 


According to RFC-1122,aTCP implementation may 
include the keep-alive mechanism. However, if TCP 
keep-alive is included, the applications must be able 
to turn it on or off in a per connection basis, and by 
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send window 
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Figure 5: The sender’s send and receiver’s receive TCP sequence spaces when a connection is quiet 
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Table 1: The results of TCP keep-alive probing in TCP 
implementations 


default, it must be off. The threshold interval to send 
TCP keep-alives must be configurable and must default 
to 7,200 seconds (two hours) or more. Because TCP 
does not reliably transmit ACK segments that carry no 
data®, an ACK segment in response to the keep-alive 
probe may be lost. Therefore, a TCP should drop the 
connection only after a predefined number of keep- 
alive probes fail to elicit response from the peer. 


3.1 Probe Procedure 


We use the following probe procedure to study 
whether an implementation of TCP uses keep-alive, 
and, if so, how they implement it. 


1. From a host to be tested, open a TCP connection 
to the discard port of another host. 


2. Enable keep-alive on the connection. 


3. Pause’ until a terminating signal occurs. 


As Figure 5 illustrates, when a TCP connection 
is quiet, the sequence number of the sender’s next 


There is no retransmission timer set for an ACK segment that 
carries no data. 

7C library function pause () may be used. 

5No keep-alive segment observed in five observations; each ob- 
servation lasted for 30 hours. 


octet to send (SND.NXT) is the same as the se- 
quence number of the receiver’s next octet to receive 
(RCV.NXT), and the size of the sender’s send window 
(SND.WND) is the same as the receiver’s receive win- 
dow size (RCV.WND). RFC-1122 recommends using 
a sequence number (SEG.SEQ) of SND.NXT-1 with 
or without one octet of garbage data as the keep-alive 
segment. Using one octet of garbage data makes the 
keep-alive mechanism compatible with early TCP im- 
plementations that cannot handle a SEG.SEQ equal to 
SND.NXT-1 without one octet of data. Because the 
sequence number SND.NXT-1 lies outside the peer’s 
receive window, it causes the peer TCP to respond with 
an ACK segment if the connection is still alive; if the 
peer has dropped the connection, it will respond with a 
reset (RST) segment instead of an ACK segment [14]. 


3.2 Results 


All the TCP implementations we tested correctly set 
the default so TCP did not send keep-alive probes, and 
let the applications turn on Keep-alive in a per connec- 
tion basis. Most implementations use a 7,200 second 
(2 hours) time interval between probes, as specified in 
RFC-1122. SunOS 4.0.3 uses a 75-second interval be- 
tween probes. However, none of the implementations 
allow users to configure the probing interval as man- 
dated in RFC-1122. Although Solaris 2.1 provides a 
socket option to turn on the TCP keep-alive, we did 
not observe any keep-alive probes in five observations; 
each observation lasted for 30 hours. 


RFC-1122 does not specify the contents of the ac- 
knowledgment field (SEG.ACK) of the keep-alive seg- 
ment. However, as Table 1 shows, most of the TCP 
implementations set the SEG.ACK to RCV.NXT-1. It 
is unnecessary to set SEG.ACK to RCV.NXT-1 unless 
it is also for backward compatibility with early TCP 
implementations. To see if probed implementations 
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respond to a keep-alive segment that has SEG.SEQ 
equal to SND.NXT-1, SEG.ACK equal to RCV.NXT, 
and does not include one octet of data, we modified 
the SunOS 4.0.3 TCP code to send such a keep-alive 
segment. All implementations responded correctly to 
the keep-alive segment. 


3.3 Keep-alive and Server Applications 


TCP keep-alive is especially useful for a server ap- 
plication to prevent clients from holding server re- 
sources indefinitely after clients crash or after a net- 
work failure. As an example to see how network fail- 
ure can affect a host when a server application does 
not turn on TCP keep-alive and does not deploy mech- 
anisms to handle idle connections, consider the probe 
procedure used in section 2. The probe procedure de- 
liberately disables interface A on host H while a probe 
program on host 7 is communicating with the TCP 
discard server? on host H via interface A. After host 
T retransmits a data segment for a preset number of 
times without any response, it closes the connection. 
Unfortunately, the discard server on host H has no idea 
that the peer has aborted the connection because it does 
not turn on the TCP keep-alive and makes no attempt 
to detect the idle connection. From its point of view 
the connection remains quiet. After each experiment, 
there is an orphan discard server process left on host H. 
These orphan server processes stay until the system re- 
boots or a system manager destroys them explicitly!”. 


4 Zero-Window Probes 


TCP in a receiving host uses the window field in each 
acknowledgement to inform TCP in the sending host 
how much more data it is willing to accept [14]. If the 
receiver temporarily runs out of buffer space, it sends 
an ACK with the window field set to zero. When space 
becomes available, the receiver sends another ACK 
with a nonzero window size. Because the ACK that 
reopens window can be lost in transit, the connection 
may hang forever. TCP specifications [1, 14] require a 
host that has received a zero window advertisement to 
transmit zero-window probe segments to the receiving 
host requesting its current buffer space if it does not 
receive a nonzero window advertisement in a specified 
period of time. The sender must increase the intervals 
between the zero-window probes exponentially as it 
does for retransmissions. 


°The program inetd implements the discard server. 

!OTo prevent too many orphan discard server processes from af- 
fecting the experiment, we destroyed the orphan process after each 
experiment. 
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zero-window probe 


data 


win = 0, ack 
data 


win = 0, ack 


zero-window probe 


buffer contains data. (3) 





Figure 6: Generating zero-window probes using TCP 
echo service 


4.1 Probing Procedure 


We use the following simple procedure to study zero- 
window probing in various TCP implementations. For 
each implementation, we conduct five experiments. 


1. From a host to be tested, open a TCP connection 
to the echo port [15] of another host. 


2. Keep sending data to the echo port without reading 
the echoed data. 


As Figure 6 shows, because the probe program sends 
data without reading the echo, the receive buffer of 
TCP A eventually becomes full, causing it to send a 
zero-window ACK segment to TCP B. Because TCP 
B cannot send data to TCP A, the send buffer of TCP 
B will become full of echoed data. When the echo 
server on B cannot send more data, the receive buffer 
of TCP B will become full. Once the receive buffer 
of TCP B becomes full, it advertises a zero window 
to TCP A. After the zero-window condition exists for 
more than a threshold time period, both sides begin 
sending zero-window probes. 


4.2 Results 


As Table 2 and Figure 7 show, all the implementa- 
tions probed exponentially increase the time interval 
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Ntoe: Only the time intervals of the first 11 zero-window probes are shown. 


Figure 7: The intervals of successive zero-window probes in TCP implementations 















Operating Data size in Min. probe | Max. probe 

[Solaris 2.1 | IMSS octets | 200ms | 60sec. 
[Sun0S4.11 | Tocet | Ssec. | 60sec. 
CHP-UX9.0 | octet | 4see. | 60sec. 
PIRIXS.I.1 | Toctet | Ssee. | 60sec. 











Table 2: Zero-window probe in TCP implementations 


between probes and limit the probe interval to a max- 
imum value of 60 seconds. Most implementations 
impose a minimum probe interval between 4 and 5 
seconds; Solaris 2.1 uses the lower bound on RTO es- 
timates as the minimum probe interval, which is much 
smaller than other systems. 


Figure 7 shows another difference between Solaris 
implementation and other systems — there are two 
curves on the graph of Solaris. One curve corresponds 
to the results of two experiments (Experiment #2 and 
#3) and the other curve corresponds to three. A plausi- 
ble explanation of the difference is that Solaris uses a 
finer granularity timer than other systems. If the probe 
intervals shown represent an exponential increase, di- 
vergence in the two curves must result from a difference 
in the initial RTO values. We conclude that Solaris 2.1 
TCP had two RTO estimates during the experiments. 


4.3 Two Approaches In Handling Zero- 
window Probing 


From the data, we observe two approaches used to 
handle zero-window probing. Observe that a sender 
does not need to distinguish between a peer that has in- 
sufficient buffer space to receive a segment and a seg- 
ment that is lost. In both situations, the data segment is 
unable to reach the application. Although a receiving 
TCP will generate a zero-window ACK segment when 
it has no receive buffer space and will not generate an 
ACK for a lost data segment, the unreliable delivery of 
the zero-window ACK segment in TCP makes both sit- 
uations look similar to a sending TCP. The observation 
suggests that one can use a retransmitted data segment 
as a zero-window probe. 


Indeed, the first approach uses a retransmitted data 
segment as a zero-window probe. If a receiving TCP 
does not have sufficient buffer space to accept an in- 
coming data segment, it sends a zero-window ACK 
without acknowledging the data segment. After a pe- 
riod of one RTO, the sender retransmits the data seg- 
ment. The retransmitted data segment acts as a zero- 
window probe. Unlike retransmitting missing data seg- 
ments, a sender keep transmitting zero-window probes 
even if a receiver does not ACK the probes. 


Using a retransmitted data segment as a zero- 
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Segment HostA Host B 
Number (SunOS 4.1.1) (Solaris 2.1) Comment 
(Both side have zero receive window) 
1094 <=—_ $:8552 D:512 A:1369 W:0 (zero-window probe) 


1095 §:1369 D:0 A:8552 W:0 ——> 
(5S seconds later) 


1096 $:1369 D:1 A:8552 W:0 ———> 


1097 <=—_ §$:9064 D:512 A:1369 W:0 


1098 $:1369 D:0 A:8552 W:0 ——> 


weeeee 


(ACK with window = 0) 


(zero-window probe) 


(ERROR! bad sequence #) 
(ACK with the seq. # expected) 


S: Sequence number, D: Number of data octets, A: Acknowledgment number, W: Window. 
Note: Only the last four digits of the sequence number and acknowledgment number are shown. 





Figure 8: Illustration of an implementation flaw in Solaris 2.1 TCP 


window probe is optimistic in the sense that it sends 
as much data as possible in a zero-window probe and 
expects the receiver’s receive window to open within 
one RTO period. The scheme responds quickly when 
an ACK that would reopen the window is lost. The 
scheme is also efficient because TCP implementations 
must implement Silly Window Syndrome avoidance 
algorithm!'![1, 2]. It is likely that when the receiver 
opens the receive window, it will open at least the 
size of a maximum segment (1 MSS). However, the 
scheme consumes more network resources than the 
second approach, described below, when the receiver’s 
zero-window persists. 


The second approach treats zero-window probing 
as a special case. When a sender receives a zero- 
window advertisement from the receiver, it enters a 
zero-window probing state and delays sending data for 
a predetermined interval t'*. If a window-opening 
ACK segment arrives within interval t, TCP immedi- 
ately sends data without sending zero-window probe. 
However, the scheme suffers a (long) delay of t if an 
ACK segment to reopen the window is lost in transit. 
The zero-window probes in this approach carry only 
one octet of data; they are designed to elicit an ACK 
segment from the peer, not to transfer data. 


From the experiments, we conclude that Solaris uses 
the first approach, and the others use the second ap- 
proach. 


!! Silly Window Syndrome is characterized as a situation in whicha 
steady pattern of small TCP window increments results in small data 
segments being sent. Sending small data segments lowers TCP per- 
formance because TCP and IP headers consume network bandwidth. 

!2Experiments show that t is 4 or 5 seconds in the implementations 
probed (see Table 2). 
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4.4 Implementation Flaw Found 


The data from zero-window probe experiments 
shows protocol violations in the SunOS 4.0.3 version 
and an implementation flaw in Solaris 2.1. SunOS 
4.0.3 TCP does not acknowledge zero-window probes 
at all. Solaris 2.1 TCP responds incorrectly to a peer’s 
zero-window probe when both sides have zero receive 
window; we describe the flaw below. 


As Figure 8 illustrates, host A communicates with 
host B (running Solaris 2.1); both hosts have a zero 
receive window. In segment #1094, B sends a zero- 
window probe with sequence number 8552 and 512 
octets of data to A. A acknowledges it properly in seg- 
ment #1095. Five seconds later, in segment #1096, A 
sends a zero-window probe with one octet of data to 
B. Note that the ACK number in segment #1096 is the 
same as the ACK number in segment #1095, i.e., A did 
not acknowledge the 512 octets of data that B sent in 
segment #1094. However, B acknowledges the zero- 
window probe with a segment (segment #1097) con- 
taining an invalid sequence number 9064 (8552+512), 
as if the zero-window probe from A had acknowledged 
the segment it sent in segment #1094. A acknowledges 
the error by sending an ACK segment (segment #1098) 
with the sequence number it expects. The flaw occurs 
in all of the Solaris trace data we gathered. 


5 Conclusion and Future Work 


This paper introduces the active probing technique 
and demonstrates how it can be used to study TCP 
implementations. The technique treats a TCP imple- 
mentation as a black box and uses specially designed 
probe procedures to examine its behavior. A packet 
trace taken during active probing can be used to de- 
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duce design parameters and design decisions in TCP 
implementations. The results show that active probing 
is an effective tool. 


Insight into black box behavior depends on probe 
procedure design and careful analysis of the resulting 
output. We demonstrated three probe procedures that 
examine three aspects of TCP. Additional probe proce- 
dures to study other aspects of TCP are also possible. 
For example, one can design a probe procedure that 
generates heavy network traffic through a gateway to 
examine how a TCP behaves in a congested environ- 
ment. 


Because active probing can be used to deduce design 
parameters and design decisions in TCP, the technique 
can also be applied to protocol conformance checking. 
One can design procedures that induce output from 
a TCP implementation, and use an automated tool to 
analyze the output and verify that it conforms to the 
protocol specification. For example, the failure to re- 
spond to the zero-window probes in SunOS 4.0.3, as 
discussed in section 4, can easily be detected by such 
a method. 


The implementation flaws found also show that ac- 
tive probing can be used to test whether TCP imple- 
mentations operate correctly. From the point of view 
of software engineering, one can design probe pro- 
cedures to create conditions that occur frequently or 
infrequently, thus providing tests that cover cases not 
normally found through passive monitoring. 


Unusual output can be used to detect implementation 
flaws in TCP. For example, an implementation of TCP 
that generates excessive retransmissions in a LAN en- 
vironment may contain an implementation flaw. The 
implementation flaws in Solaris 2.1, as discussed in 
sections 2 and 4, were detected by observing excessive 
retransmissions in the trace output. It would be in- 
teresting to combine a knowledge-based trace analysis 
tool [5] with active probing to accurately detect other 
abnormal TCP behavior. 


Finally, most of the TCP implementations probed 
in this paper are BSD derived TCP implementations. 
It is possible to probe non-BSD derived TCPs (e.g., 
Plan9 TCP [13, 17] and Xinu TCP [3]) to determine 
the similarities and differences. 
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Discovering Network Time Protocol Servers 
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Abstract 


The Network Time Protocol (NTP) is widely used to 
synchronize computer clocks throughout the Internet. 
Existing NTP clients and servers form a very large 
distributed system, yet the tools available to observe 
and manage this system are fairly primitive. This 
paper describes our experiences with a prototype tool 
that attempts to discover relevant information about 
every NTP site on the Internet. The data produced by 
this tool can be used for a variety of purposes, includ- 
ing locating nearby accurate time servers and com- 
puting aggregate and long-term evaluations of the 
size and health of the NTP system. We are building a 
client/server system around this tool, to allow new 
NTP server administrators to make informed choices 
among the possible servers with which to synchron- 
ize, balancing the need for accurate time with the 
need to distribute NTP server load. This is an impor- 
tant step towards improving global NTP system sca- 
lability, since at present our measurements indicate 
that the high-stratum servers are heavily overloaded. 


1. Introduction 


1.1. Clock Synchronization and NTP 


Clock synchronization is useful for a wide variety of 
purposes, particularly in a network environment. 
Uses include keeping accurate file timestamps in dis- 
tributed filesystems (e.g. so that make doesn’t get 
confused), ticket validation timers in security systems 
like Kerberos [SNS88], and potentially even syn- 
chronizing clocks across the globe for very-long 
baseline radio astronomy work. 


Designed and developed by David Mills of the 
University of Delaware, the Network Time Protocol 
(NTP) [MIL92] is currently used by thousands of 
Internet hosts to synchronize their local clocks to 
within a few milliseconds of the international time 


standard. Systems to distribute accurate time have 
been studied for some time [LIN80, BRA80], but 
NTP’s contribution is that it is able to distribute accu- 
rate time over the unpredictable Internet. Hosts parti- 
cipating in time distribution via NTP form a hierarch- 


ical, master-slave, self-organizing subnet! of the 
Internet. At the top layer of the hierarchy are the 
sources of very accurate time. These stratum-] 
servers usually have local atomic clocks (that are 
synchronized by means other than NTP) or radio 
receivers that decode time signals broadcast by 
national standards organizations (eg. WWV 
timecodes broadcast by NIST in Ft. Collins, 
Colorado). 


NTP goes to great lengths to distinguish time 
servers with accurate data from those with false data 
and to obtain the best synchronization possible in the 
face of widely varying Internet delays. Recent ver- 
sions of NTP include algorithms to combine the 
offsets of several clocks, to construct a synthetic time 
more accurate than any individual time server. 


In addition to its value for supporting clock 
synchronization, running an NTP server can 
indirectly help network managers uncover and diag- 
nose network problems, based on the statistics NTP 
maintains. For example, NTP keeps a status register 
of how "reachable" its servers have been recently. 
This information provides a crude but useful measure 
of packet loss, which can be helpful in diagnosing 
network load or connectivity instabilities. 


We chose to study NTP because it is an impor- 
tant and widely distributed system. At present, it is 
"bundled" with the software distributions from a 
number of workstation manufacturers, and a number 
of large organizations use NTP. The U.S. Weather 





'The word "subnet" is used in this paper to refer to the sub- 
set of Internet hosts that use the NTP protocol. It should not be 
confused with the more common usage, which denotes partitioning 
a single IP network number into multiple smaller networks. 
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Service uses NTP, and soon every Public Broadcast- 
ing Service station affiliate will be a client. Merrill 
Lynch is currently populating its worldwide network 
with NTP. 


1.2. NTP Management Problems 


While NTP goes to great lengths to maintain well- 
synchronized clocks in the face of unpredictable 
Internet behavior, at present managing the NTP net- 
work itself is quite difficult. The latest NTP software 
distribution includes a few debugging tools for exa- 
mining and changing the state of an individual server, 
but does not include tools to discover the nearest 
NTP server or to help debug the NTP system as a 
whole. When only a few sites ran NTP, this informa- 
tion was easy to gather by manual methods. But now 
that there are many thousands of sites running some 
version of NTP, the need for additional discovery and 
management tools has become painfully clear. 


The largest problem is that the stratum-1 
servers are seriously overworked and in danger of 
becoming saturated. In principle this should not be a 
problem, because NTP allows time to be distributed 
hierarchically. If this hierarchical architecture were 
used appropriately, then the NTP protocol is theoreti- 
cally capable of distributing accurate time to every 


host on the Internet.” 


The problem arises because it is not trivial for 
an NTP server administrator to pick an appropriate 
set of servers with which to "chime" (i.e., synchron- 
ize clocks) when first joining the NTP server net- 
work. As a result, far too many administrators select 


high-level NTP servers.? An NTP discovery system 
that allowed new administrators to identify "nearby" 
servers would reduce this problem, and markedly 
enhance global system scalability. Running periodic 
surveys would also make it possible for regional net- 
work administrators to monitor the configuration of 
the NTP subnet within their domain and apply social 
pressure to fix poorly configured NTP hosts. 


An alternative approach would be to restrict 
which systems could chime with high-stratum NTP 
servers, through an access control mechanism. While 
the code for this approach has been implemented, it 
has not been widely adopted. In our opinion this 
approach should be avoided — it would result in more 
work for the administrators of the restrictive servers, 
while merely increasing the load on the remaining 


For example, assuming only 10 stratum-1 servers and a 
very light load of 10 clients per server, the NTP subnet could con- 
tain a maximum of 10°~ hosts. This is much larger than the 
current IP address space. 


*See the following section on Survey Results, and Figure 2 
for details. 


unrestricted servers. Because the problem arises 
from the inability to obtain good information, a solu- 
tion based on discovery seems more appropriate than 
one based on access control. 


In addition to supporting more well-informed 
configuration management choices, an NTP 
discovery/survey tool is useful for helping to debug 
and understand the NTP protocol itself. To our 
knowledge there has never been an aggregate picture 
of the "state of the NTP network" at a finer level of 
detail than a simple count of the number of hosts run- 
ning NTP. We hope that by offering the ability to 
collect meaningful measurements of the state of the 
NTP world, a deeper understanding of the workings 
of this large, distributed system can be attained, 
which would enable further improvements to the 
NTP model. 


2. Survey Methodology 


There is no complete registry of hosts that run NTP. 
Instead, we begin with a list of hosts known to run 
NTP, query each host with an NTP information 
request, parse the response, and add any previously 
unseen hosts to the database. These newly 
discovered hosts are then queried and the process 
repeats until no new hosts are discovered. 


At first examination, one might believe that 
this iterative survey process would quickly discover 
all Internet NTP hosts. For a variety of reasons, this 
is not so. In the following sections we discuss the 
problems and the prototype’s approach to solving 
them. 


2.1. Monitor List Queries 


The biggest difficulty in implementing an NTP sur- 
vey is that the protocol does not require each NTP 
time server to keep track of its clients. While all 
NTP implementations require clients to track the 
server with which they chime, they may keep little or 
no information about clients that chime with them. 
This makes it easy to move up the NTP hierarchy, 
but often impossible to move down. Given that we 
know the most about the top two levels of the tree 
and relatively little about the leaves, this is a serious 
restriction. 


While keeping track of clients is not required 
by the NTP specification, it is frequently very useful 
for debugging. Client tracking is supported by the 
XNTP implementation of NTP Version 3 [XNTP93]. 
This debugging feature of XNTP is called "monitor 
mode", and the results of monitoring can be retrieved 
remotely with the "monitor list" command. 


2.2. System and Peer Variable List Queries 


Versions 2 and 3 of the NTP protocol specify a stan- 
dard control packet format that allows internal NTP 
variables to be examined and set. These variables 
contain useful information about the state of the local 
NTP system, the software phase-locked-loops and 
filters that it uses, the status of the peers with which a 
host is chiming, and a wealth of other information. 
These queries provide a good source of information 
for an NTP survey. 


An immediate problem with an NTP survey 
tool is to choose what variables should be requested. 
The current solution to this problem is to allow the 
user to specify the set of interesting variables. A 
configuration file lists all of the system and peer vari- 
ables to be used in queries, and any returned values 
for these variables are recorded by the survey. 


One problem with variable queries is that dif- 
ferent implementations of NTP can have slightly dif- 
ferent spellings of some of the variable names, result- 
ing in query failures when a server is asked for non- 
existent variables. What makes this minor problem 
much worse is the fact that, while a variable query 
command can contain an long list of variable names 
whose values are to be returned, all observed NTP 
implementations simply return an error if any of the 
variable names are unknown to that implementation. 
This leads to the need for a rather complex 
query/error/retry algorithm to extract data from each 
NTP server. 


A smaller problem with variable queries was 
caused by the fact that query responses are returned 
formatted for human-consumption, including white 
space, punctuation and newlines. While this 
response makes it easy to write simple query/display 
programs, it made implementation somewhat more 
difficult for our NTP survey tool. 


2.3. Version Issues 


The NTP protocol is an evolving entity, and various 
implementors have made substantial improvements 
with four major versions of the protocol over the past 
eight years [MIL85, MIL88, MIL89, MIL92]. But as 
with any widely deployed system, there have been a 
few compromises made to facilitate backwards com- 
patibility. 


The basic idea is that newer versions of NTP 
may interoperate with systems running older ver- 
sions, but when they do so it is suggested that they 
"fib" about their own version number so as not to 
confuse the remote system. This is wonderful from a 
compatibility point-of-view, but makes it difficult for 
our survey tool to determine correct version numbers. 


2.4. Access Restrictions 


A potential problem that we ignored in our current 
prototype is the issue of authentication. Given that a 
malicious user could try and skew clocks by supply- 
ing faulty NTP times, the NTP protocol specification 
includes support to authenticate requests. Unfor- 
tunately, an NTP server that uses authentication is 
not queryable by our discovery tool. Even though the 
authentication code is widely deployed, the need for 
it has not become severe, and at least for the moment 
an NTP explorer can blissfully explore the network 
without worrying about lack of permissions and 
magic keys. 


One reason NTP-based access controls are not 
often used is that sites often use firewall gateways 
[CQ92] to control all incoming traffic, rather than 
setting up restrictions on a per-service basis. These 
gateways present difficulties for our survey tool 
because there is no easy method of determining 
whether a potential NTP host is behind a firewall. 
From the survey tool’s perspective the host appears 
to be "temporarily" unreachable. The only solution 
we can see for this problem is to add survey functions 
to the firewalls so that (approved) statistics of what’s 
going on behind the firewall can be exported to sur- 
vey tools like this one. 


One final access restriction note concerns a les- 
son learned by other network information discovery 
projects: Some Internet system administrators con- 
sider network discovery methods to be distressingly 
equivalent to trespassing. Fortunately, this is such a 
small percentage of the Internet community that their 
hosts can safely be omitted from the survey without 
seriously affecting the survey results. The prototype 
NTP explorer module has a simple method of host 
and network avoidance. We initialized its "don’t 
trespass" database from a list of systems whose 
administrators had previously requested to be left 
alone. 


3. Implementation 


The current implementation consists of two query 
programs written in C, and a small collection of utili- 
ties and filters written in C and PERL. The two 
query modules both take a list of IP addresses as 
input, do either a monitor-list or variable query, and 
update the databases with their results. 


There are two very different types of data that 
need to be collected by the NTP survey tool. One 
type of data is the NTP topology and the set of hosts 
that participate in the NTP subnet. These data are 
relatively small and easy to manage. 


The other type of data that need to be collected 
by an NTP survey is a large amount of NTP state 
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information giving the status of the NTP protocol 
engine at each reachable NTP host in the Internet. 
These data need to be recorded and made available to 
analysis tools to understand how the NTP network is 
performing and changing over time. 


The current implementation splits these two 
types of data into different databases. One is a stan- 
dard UNIX "dbm" file that contains minimal informa- 
tion about each host that is an actual or suspected 
participant in the NTP subnet, along with the times- 
tamps and status codes of recent NTP query attempts. 
The other is a relational database recording all the 
detailed NTP data that are gathered by a particular 
survey run. These later data are stored in an RDB 
[HOB93] database, which supports a simple rela- 
tional model that eases manipulation and analysis. 


We seed the survey database from a list of 
publically available stratum-1 and stratum-2 time 


servers manually maintained by Mills.* At present 
this file lists approximately 35 stratum-1 servers and 
70 stratum-2 servers. Since this file is not designed 
to be parsed electronically, we manually extracted 
time server host names from this file, and used them 
to seed the survey database. This initialization pro- 
gram is the only module in the prototype that allows 
the use of full domain-style hostnames. The rest of 
the implementation simply uses IP addresses in the 
interest of performance. Additional hosts to check 
can be added to the database by hand or by other 
means (e.g. a Fremont explorer module that has rea- 
son to believe that a host may be chiming NTP; see 
the Related Work section for a discussion of 
Fremont). 


4. Survey Experiences and Results 


The monitor list query uses a packet format that the 
NTP Version 3 standard defines only as "reserved for 
private use." Fortunately, the monitor list command 
is often available, because it is included in the widely 
deployed XNTP implementation. However, not all 
sites that run NTP use XNTP, and many that do run 
XNTP leave monitor mode disabled (it is disabled by 
default). Moreover, as mentioned earlier, some of 
those that collect monitor data require prior authori- 
zation to use the "monitor list" query and retrieve the 
information. 


Even with this daunting list of restrictions, it 
turns out that there are enough publically retrievable 
monitor listings that using this style of query resulted 
in gathering evidence of approximately 10,000 possi- 
ble NTP hosts in about 8 hours of survey time. 


“Available by anonymous FTP from _louie.udel.edu, 
/pub/ntp/doc/clock.txt 


By comparison, the survey module that queries 
for system and peer variables plods along gathering a 
great deal of information, but adds comparatively 
few hosts to the database of potential NTP systems. 
It took about 50 hours to attempt to query 10,000 


hosts.° 


The number of NTP hosts found in the initial 
survey was relatively large. The main database con- 
tains over 15,000 unique IP addresses of hosts that 
we have reason to believe speak NTP, and the survey 
was able to speak NTP directly with over 7,200 sys- 
tems. While this database contains no duplicate IP 
addresses, hosts with multiple network interfaces 
may be counted twice. 


In comparison with other NTP surveys, our 
survey has done rather well. Mills’ survey of July 
1993 [MIL93a] found a total of 6,185 hosts (via mon- 
itor list), while a survey done by Pruy in October 
1993 [PRUY93] was able to communicate with about 
2,100 hosts. 


The results of an NTP survey are a little tricky 
to summarize without being misleading. The set of 
NTP hosts in the database built by recursively run- 
ning the "monitor list" command is much larger than 
the set of NTP hosts that are actually reachable by 
NTP information queries. Therefore, we distinguish 
the results below by data source. 


4.1. Statistics from ‘‘Monitor List’’ Queries 


The monitor facility of XNTP records information 
about a particular server’s clients. The IP addresses 
and NTP protocol version number of its clients are 
recorded, but very little else is recorded. As 
described before, the version number can be ficti- 
tious, and must be taken with a dose of skepticism. 


The total number of IP hosts derived from 
1,760 hosts that responded to monitor lists queries are 
listed in Table 1. It is interesting that, although the 
majority of sites are running the most recent version 





Version 3 7,615 
Version 2 2,432 
Version | 2,095 
Version 0 58 
Total Hosts 12,200 
Table 1: NTP Host Count 





‘The long survey times are a result of the current 
implementation’s use of sequential reads and timeouts. 


of NTP, many sites still have not upgraded. Part of 
the problem is that a number of workstation manufac- 
turers are bundling outdated versions of NTP 
[MIL93b]. It would be interesting to collect these 
measurements periodically as the NTP subnet contin- 
ues to grow, and see what percentage of "old" hosts 
upgrade vs. how many new hosts start by running the 
latest version. 


4.2. Statistics from ‘‘ Variable List’’ Queries 


The data from the peer and system variable queries 
are more accurate than those from monitor list 
queries, though not nearly as complete. Even so, an 
amazing amount of raw data from our survey is avail- 
able for analysis. The following statistics are but a 
first-pass at mining interesting information from it. 


While there were over 15,000 hosts in the com- 
pleted survey’s database of suspected NTP hosts, 
only 7,251 responded to NTP variable list queries. 
The statistics in Figure 1 are summarized from the 
data returned by these 7,251 hosts. Clearly, the 
high-level strata are overused. At present, Mills 
attempts to reduce this problem periodically by send- 
ing a message on the NTP mailing list asking people 
to back off of the stratum-1 servers and to make more 
use of the stratum-2 servers. Perhaps if our NTP 
discovery tool were built into the system (so that new 
site administrators could choose a good peer with 
which to chime), this would be less of a problem. 
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Figure 1: NTP Hosts per Stratum 





Note that stratum 16 is defined by the NTP pro- 
tocol specification as infinitely far away from the 
time source. The hosts that claim to be at strata 13 
through 15 have more subtle problems. At first 
examination they appear to be a collection of isolated 
hosts in Germany with their own time source that 
believes itself to be at stratum-13, plus an extremely 
confused set of hosts at the University of Tennessee. 


The average number of clients per server can 
be directly computed from the above figure, and are 
shown in Figure 2. This figure clearly indicates how 
poorly the stratum-2 (and lower) servers are utilized. 
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Figure 2: NTP Tree Branching 
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Central to the NTP clock calculations are the 
delay and dispersion values between a client and its 
server. Delay is the round-trip network delay 
between the peer and its server. Dispersion is the 
computed maximum error of the peer clock relative 
to its server. Both quantities are typically measured 
in milliseconds. Table 2 shows some statistics 
regarding the aggregate delay and dispersion between 
clients and their servers, grouped by the stratum level 
of the servers. 


Rootdelay is the estimated total delay to the top 
of the NTP tree, while rootdispersion provides an 
error bound for how far off the local clock is from the 
NTP stratum-1 time source. Rootdispersion is prob- 
ably as good a metric to estimate the health of the 
NTP subnet as any. Table 3 summarizes the rootde- 
lays and rootdispersions from clients at each stratum 
to their NTP time source. 


Tables 2 and 3 show that there is a fairly small 
median error as the time is distributed from stratum 
to stratum, with the average being substantially 
higher. In other words, many low-stratum servers 
offer very good time, but some offer very bad time. 


Without advance knowledge of “how good" a 
set of time servers might be, people tend to pick 
high-stratum servers, because servers at the top of the 
distribution tree would intuitively seem to provide 
more accurate times. Yet, as Tables 2 and 3 show, 
this is not necessarily the case. In fact, because of 
the new NTP clock-synthesis code, it is actually pos- 
sible that a low-stratum clock may be more accurate 
than any of its parents in the distribution tree. Tools 
are needed so that new NTP server administrators 
can make informed choices among the possible 
servers with which to chime, so that they can pick a 
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server that provides accurate time without overload- 
ing the high-level servers. 


5. Observations 


After examining the results of our survey, a number 
of unexpected results emerged. The most surprising 
are enumerated below: 


e A large number of hosts run NTP version 1. 
Out of the 15,000 hosts in the database, the 
NTP survey tools were able to speak NTP with 
about 7,200 hosts. About 7,700 of the remain- 
ing hosts did not respond to either Version 2 or 
Version 3 NTP packets. A small test program 
revealed that 5,600 of these hosts were reach- 


able®, and of these 2,350 hosts would respond 
to an NTP version 1 date query. 


e Many hosts appear in monitor lists that do not 
run NTP servers. The numbers mentioned 
above imply about 3,300 hosts that are reach- 
able and are suspected of running NTP (i.e., 
appear in the database for one reason or 
another), do not run a full NTP peer. The most 
likely conclusion is that these hosts use NTP to 
set their clocks when booting, but do not run 
the normal NTP server process. 


e After sending a query to an NTP host, the 
response would often come back from an IP 
address different than what was expected. The 
NTP code currently treats this as an error, but it 
should instead be treated as a serendipitous 
discovery of multiple network interfaces on a 
gateway. 


e While the initial implementation had a simple 
filter to delete the "loopback" host and other 
obviously invalid IP addresses before adding 
them to the database, about 85 of the 15,000 
"hosts" still turned out to be network numbers. 
It would be useful to track down how these 
network numbers were introduced. 


5.1. The Need for Stability Measures 


One final observation is that the current NTP tools 
provide a way to observe the time errors at the time 
of the query. However, because NTP periodically 
resynchronizes clocks, these measures may not cap- 
ture instabilities. For example, there was a problem 
with the latest experimental version of NTP when run 
on HP workstations, that occasionally caused the 
clock to be incorrect by about 40 years. If a down- 
Stream client synchronized with such a clock after 
this error arose, it could lead to many problems. A 


Dispersion Dispersion Dispersion 


Stratum Average Median Std. Dev. Average Median Std. Dev. 
l 105 79 111 38 20 87 


Clients to Delay Delay Delay 
2 42 30 74 
3 36 39 62 
4 42 43 19 
5 50 33 19 


46 18 164 
33 21 125 
114 7 153 
Liz 91 83 


Table 2: Client-to-Server Metrics [msec.] 


Clients at Root Root Root Delay 
Stratum Delay Delay Std. Dev. 
Average Median 
2 155 95 177 
3 160 104 166 
4 184 105 163 
: 116 28 182 


Root Root Root 
Dispersion Dispersion Dispersion 
Average Median Std. Dev. 

102 52 213 
175 114 251 
195 145 259 
362 166 419 


Table 3: Client-to-Root Metrics [msec.] 





°The program simply tested whether it could get a response from the host’s UDP echo port. 
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global NTP survey/management system should pro- 
vide measures of clock error that would allow poten- 
tial clients to avoid such instabilities. 


5.2. What was Missed? 


What percentage of the total do these numbers 
reflect? It’s impossible to know for sure, since there 
are many firewalls on the Internet hiding an unknown 
number of NTP hosts. Dave Mills has estimated that 
there are approximately 100,000 hosts running NTP, 
tens of thousands of which are behind firewalls 
(MIL93b]. The only way to know for sure is to get 
the cooperation of the firewall sites to allow surveys 
into their domains. In the mean time, modifying 
XNTP to enable the monitor-list feature by default 
would quickly extend the reach of this survey tool. 


6. Related Work 


There are a several systems related to various aspects 
of the current work. Census is a tool that recursively 
descends the Domain Naming System (DNS) tree, 
gathering information from as many sites as possible 
[GAN92]. While both Census and our survey tool 
attempt to discover a distributed collection of servers, 
the task is more difficult for NTP servers because not 
all servers track the clients they support. DOC is a 
tool that tests remote DNS servers for various 
configuration errors [HOT90]. While our survey tool 
focuses on performance problems caused by distribu- 
tion tree imbalance, DOC uncovers incorrect DNS 
server configuration information. 


Archie gathers directory listing information 
from "anonymous FTP" servers around the Internet, 
for use as an indexing/search service [ED92]. Archie 
is intended primarily as a location service, not for 
detecting problems. 


The Simple Network Management Protocol 
(SNMP) defines a general method to query and con- 
trol network servers [SNMP90]. If the NTP system 
were being written today, it would probably use 
SNMP instead of its native query/control protocol. 
Perhaps in time NTP will be changed to use SNMP, 
at which point the NTP survey code can begin to use 
more standard network management tools. 


The Fremont system gathers and cross- 
correlates data from a number of network protocols 
and information sources to construct a picture of key 
network characteristics, such as hosts, gateways, and 
topology [WCS93]. Each source is collected by a 
distinct "explorer module", which deposits the gath- 
ered data in a network accessible database managed 
by a journal server. The invocation of explorer 
modules is under the control of a discovery manager, 
which decides what information needs to be collected 


and which explorer modules should be invoked to 
collect the data. 


Fremont’s support for multiple explorer 
modules, a discovery manager, and journal server, 
makes the opportunity for synergistic results obvious. 
The data mined by the current NTP survey tool 
should eventually be turned into data for the Fremont 
system. 


7. Conclusions 


In this paper we presented a survey tool for discover- 
ing the topology of the global NTP server network, 
and a set of experimental results from running this 
tool on the Internet. The basic survey methodology 
involves iteratively expanding a seed list of NTP 
hosts, based on information retrieved by NTP 
queries. This approach is complicated by a number 
of operational issues in the NTP network, ranging 
from version mismatches to firewall gateways and 
other limitations. 


Our survey provides measurements of the 
current size and configuration of the NTP server net- 
work, uncovering approximately 10% of the total 
number of estimated hosts running NTP. The survey 
was limited primarily by the presence of firewall 
gateways in the Internet. The results showed that the 
primary time servers are overloaded, and that the glo- 
bal NTP time distribution tree is very poorly bal- 
anced. 


To allow continued growth of the NTP com- 
munity, a more balanced time distribution tree is 
necessary. Our discovery tool is a first step in 
developing the tools needed to help extend the scala- 
bility of the NTP system. In particular, Tables 2 and 
3 demonstrate that the times from low-stratum 
servers often provide accurate times. Our tool pro- 
vides a means by which new NTP server administra- 
tors can make informed choices among the possible 
servers with which to chime. 


While the total extent of the NTP subnet prob- 
ably won’t be known until more hosts enable the 
"monitor list" command, the usefulness of this tool 
may encourage more people to do so. In the mean- 
time, the current sample size seems large enough to 
collect more than enough information to make it 
interesting — and hopefully useful — to a wide variety 
of users. 


One might observe that a discovery tool would 
not be necessary if NTP itself kept records of the 
server network topology (e.g., maintaining both 
upwards and downwards pointers, as is done in the 
DNS tree). There are at least two general reasons 
NTP does not provide adequate support in this 
regard. First, when developing any complex 
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algorithm, it behooves software designers to focus on 
the problem at hand. In retrospect it is usually easy 
to point out what "should" have been planned into the 
system from the start, yet it makes sense to postpone 
worrying about issues that will only arise once a sys- 
tem becomes "wildly successful". Second, it often is 
not clear from the start what state information should 
be collected to aid in system management. 


From this perspective, our discovery and sur- 
vey tool provides both a "transition path" towards a 
future management-oriented version of NTP, and a 
set of experiences concerning what is needed. 


8. Future Work 


8.1. Client/Server Architecture for Exporting 
Data 


At present we are developing an architecture within 
which to deploy our survey tool. In this system, a set 
of servers will explore and measure the NTP network 
from various parts of the Internet, and allow survey 
clients to pose queries concerning which NTP server 
they should select for clock synchronization. The 
architecture will need to provide topology-dependent 
answers, so that NTP clients are paired with nearby 
NTP servers. We will also provide an interface that 
accepts a set of network numbers (e.g., all the net- 
works in a regional network), and displays NTP 
configuration data about the hosts within that region. 
With this tool one could quickly check if too many 
hosts are chiming outside their local region. 


We will make the above software available by 
anonymous FTP from _ ftp.cs.colorado.edu in 
/pub/cs/distribs/chimesurvey, around the end of Sum- 
mer 1994, 


8.2. Performance Improvements 


Even though the number of packets sent to do the 


NTP queries is fairly small,’ our tool currently takes 
a long time to complete a survey. The tool could be 
sped up significantly by using a simple "pipelined" 
design that allowed for overlapping I/O with 
timeouts. The major complications are parallel 
updates to the databases, and the fact that the NTP 
protocol is based on UDP (and hence the operating 
system interface is full of dangers of dropping pack- 
ets). Neither of these are serious obstacles, and a 
fully parallel version of this explorer should be 
implemented. 


7About 5-10 packets per host. 


8.3. Additional Sources of Potential NTP 
Hosts 


Our survey’s successfulness depends critically on 
what hosts were initially seeded in the survey data- 
base. To improve the survey’s thoroughness, there 
should be other methods of seeding this database. 
Some possible methods include Ethernet monitoring 


logs; queries of selected hosts found in the DNS®: 
hostnames from messages posted to the netnews 
group that discusses NTP; hosts discovered from 
DNS traversals that contain "ntp" in their name; and 
maybe even checking hosts that have retrieved the 
NTP source using FTP from louie.udel.edu. 


8.4. Integration with Fremont 


Our NTP survey tool should be integrated with the 
Fremont discovery system [WCS93]. The NTP 
configuration program(s) could make use of the 
topology of the network that is recorded in the 
current Fremont database, and the NTP survey occa- 
sionally discovers hosts with multiple network inter- 
faces that would be of interest to Fremont. 
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Abstract 


Run-time resolution of library functions pro- 
vides a rich and powerful opportunity to collect work- 
load profiles and function/parameter trace information 
without source, special compilation, or special linking. 
This can be accomplished by having the linker resolve 
library functions to special wrapper functions that col- 
lect statistics before and after calling the real library 
function, leaving both the application and real library 
unaltered. The set of dynamic libraries is quite large 
including interesting libraries like libc (the C library 
and Operating System interface), graphics, database, 
network interface, and many more. Coupling this with 
the ability to simultaneously trace multiple processes 
on multiple processors covering both client and server 
processes yields tremendous feedback. We have found 
the amount of detailed information that can be gath- 
ered has been useful in many stages of the project life- 
cycle including the design, development, tuning, and 
sustaining of hardware, libraries, and applications. 


This paper first contrasts our extended view of 
interposition to other profiling, tracing, and interpos- 
ing techniques. This is followed by a description and 
sample output of tools developed around this view; a 
discussion of obstacles encountered developing the 
tools; and finally, a discussion of anticipated and unan- 
ticipated ways those tools have been applied. 


1. Motivation 


The tools described in this paper were created 
to analyze performance of graphics applications. The 
application writers seldom has access to the graphics 
library source or profiled versions of the graphics li- 
braries. The library and hardware provider seldom has 
access to application source and data files. Our goal 
was to get useful performance data without special re- 
quests placed on either the application or libraries. 
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We were also after more information than is 
typically available. While traditional profile tools 
might tell you how many times a line drawing function 
is called, they don’t tell you what percentage of the 
lines are write-only versus read-modify-write opera- 
tions; what the average length, width, and angle of the 
lines are; and what line styles are used. One can envi- 
sion similar questions for a database, such as percent- 
ages of read versus write transactions, if access 
patterns were sequential or random, etc. This addition- 
al information significantly improves the ability to 
perform more detailed analysis and make more in- 
formed decisions. 


Graphics libraries remain our group’s primary 
interest but the tools are generic to any dynamic library 
and have been applied both internally and externally to 
profile, trace, and generally interpose on non-graphics 
libraries. Additionally, the data that can be collected 
has proven useful well beyond performance analysis. 


2. Terminology 


For detailed discussions on dynamic linking, 
the reader should refer to [1,2,3]. The techniques de- 
scribed in this paper presume the applications which 
are to be profiled and/or traced have already made the 
decision to use dynamically linked libraries and that 
the run-time linker/loader provides a means to perform 
interposition. System V release 4 (S Vr4) UNIX and all 
versions of Solaris provide that means through an 
identical interface. This technology has been around 
for several years, is easily used, and commonly found 
in operating systems. A brief description of the terms 
used by this paper should quickly resolve various op- 
erating system terminology issues. 

A dynamic library consists of a set of variables 


and functions which are compiled and linked together 
with the assumption that they will be shared by multi- 
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ple processes simultaneously and not redundantly cop- 
ied into each application. Because of this sharing, 
dynamic libraries are sometimes called shared librar- 
ies or shared objects. Compiler and linker flags assure 
the program sections and data sections are cleanly sep- 
arated and the program sections are reentrant and reus- 
able. The compile and link of the application can leave 
the dynamic library symbols (variable and function ad- 
dresses) unresolved until run-time (or time of execu- 
tion). This complicates the /oader (the program which 
initiates an application) since the linking (resolution of 
all symbols) must be completed on execution rather 
than at compile/link time. 

The process of placing a new or different li- 
brary function between the application and its refer- 
ence to a library function is called interposing. We 
specifically avoid placing any constraint on what the 
interposing function must do other than accept the pa- 
rameters of the real function and return an appropriate 
value back to the application. The real function may or 
may not be called and new side effects may or may not 
occur due to the interposing function. 


3. Profiling and Tracing Techniques 


We have developed a set of tools under the um- 
brella name SLI (pronounced sly) which is an acronym 
for Shared Library Interposer. SLI contains programs 
and utilities that enable application and library devel- 
opers to monitor and analyze calls to shared library 
functions. SLI is intended to augment, not replace, 
analysis tools such as tcov [4,5,6,7], gprof [4,5,7], and 
analyzer [10]. This section provides an overview of 
our technique and contrasts it to other techniques. 


3.1. Overview of SLI’s Technique 


Dynamically linked 
user application : 


eee eee eee E RESO P OOOO OSES ECOSOC OOSO OOO OS COTO ORE T EROS TT ee ee TT 
ee ee ee ee 


SLI version 
of a shared library 







se ck eee 


ee ees 
eee eRe ee ee 


What distinguishes SLI from traditional analy- 
sis tools is how it collects information and what infor- 


mation it collects. The loader waits until execution of 
an application to resolve the shared library functions 
called by the application. SLI has the loader resolve 
the addresses to a wrapper of the library function, col- 
lects information in the wrapper, and calls the real li- 
brary function from the wrapper. Several advantages 
arise from this technique: 


e An accurate trace of call sequences can be logged. 


e Parameter values are available to be logged and/or 
altered both before and after the real function call. 


e The real function can be replaced if desired. 


e Any subset of the library functions can be profiled in- 
stead of all functions in a library. 


¢ Nesting levels into the libraries can be controlled. 


e Different levels of profiling can be enabled or dis- 
abled while the application is running. 


© Multiple processes and multiple library statistics can 
be logged to single or multiple locations. 


¢ Profiling is available without application or library 
source and without requiring any specially com- 
piled or linked objects. The only requirement is 
that the application must be dynamically linked to 
the shared libraries of interest. 


Some of these advantages can be found in other 
tools or other tools can be altered to produce similar re- 
sults, but SLI pulls it all together in an easily maintain- 
able, dynamically controllable, and customizable 
package which remains independent of the application 
and library sources. 


3.2. Comparison To Other Techniques 


The trace command [4,5] of BSD UNIX and 
truss command [6,7] of SVr4 UNIX demonstrate some 
desirable features. These commands run an application 
or attach to an active process providing a trace of the 
system calls made by the application, showing the pa- 
rameter values and return values or a summary count 
of all system calls and total time spent in each. Truss 
also allows restrictions on which calls are reported. No 
special flags are required when the application is com- 
piled. With the exception of attaching to an active pro- 
cess, interposition of dynamic libraries allows all of 
these features to be applied to user level libraries. Ad- 
ditionally, our tools allow programmatic and interac- 
tive control of the data collection and multiple process 
data collection to a single file or per-process separate 
files for custom postprocessing reports. For some li- 
braries, such as graphics or database libraries, all up- 
dates must pass through the library so parameter and 
return value capturing is sufficient to record and replay 
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an application run. This has many positive ramifica- 
tions on the project lifecycle as discussed in section 6. 


System call interposition can significantly ex- 
pand operating system functionality and transparently 
provide a number of new services to applications. The 
COLA [17] and nDFS [18] projects are examples of 
expanding file system functionality through system 
call interposition. The Interposition Agents Toolkit 
[16] presents a number of clever examples of system 
call interposition utilities and provides an environment 
to easily create new utilities. 


Trace, Truss, and Interposition Agents are im- 
plemented through an operating system trap mecha- 
nism for system calls. The trap mechanism allows both 
dynamically and statically linked applications to bene- 
fit from the utilities and allows attaching to an active 
process. However, the trap mechanism incurs a heavi- 
er overhead and is not available for user level library 
functions. Inversely, interposing on dynamic libraries 
does not work for statically linked applications and 
must be selected prior to the application execution but 
introduces less overhead and allows more libraries to 
be interposed. Interestingly, the end results of any of 
these tools and utilities is independent of the interpos- 
ing technique so informed decisions can be made in 
the selection of a technique. 


Unfortunately, most tools for profiling and 
tracing require that the source code of the application 
and libraries be compiled and linked with options dif- 
ferent from those of the release executable. That alone 
can change the executable enough to skew results sig- 
nificantly. The prof [4,5,6,7] command of UNIX and 
its variants gprof and /prof are the most common pro- 
file report generators in the UNIX environment. Spe- 
cial compilation flags cause code to be added to each 
function to maintain counts. When the application is 
run, it is interrupted on regular intervals and informa- 
tion about the currently active function is collected. 
This information, and counts of all function entries, are 
written to a file upon application completion. This 
technique has extremely low overhead but lacks de- 
tailed accuracy, is limited to the functions that were 
specially compiled, and only allows a single applica- 
tion per data file. Interposing on dynamic libraries 
overcomes these restrictions but only for library func- 
tions, not the functions of the application. The time in- 
terval between library calls can be monitored, giving 
some measure of application time versus library time, 
but no detailed profile of the application functions is 
collected. It is possible to add calls to our toolkit li- 
brary directly in the application or library source or ob- 
ject but that defeats the concept of interposition to 
avoid altering application source code and linking. 
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More onerous than the potentially skewed re- 
sults of special compilation is the requirement for 
source code, or at least specially compiled but un- 
linked object files. Operating system vendors often 
provide multiple versions of a library: an optimized 
dynamic version; an optimized static version; a profile 
static version; and internally there may be debug dy- 
namic and static versions. The application writer then 
compiles and links appropriately. Debuggers such as 
adb[4,5,6,7], dbx [4,5,7], and sdb[6] require the source 
files be present to exploit their full power. Special in- 
terposing functions generally require relinking the ap- 
plication with new versions of the functions. It is 
extremely common to see alternate versions of the libc 
memory allocation routines malloc/free [4,5,6,7,8,9]. 
The precedence of linking allows interposition to oc- 
cur either at compile/link time or run-time. Linkers re- 
solve references on a first-come first-serve basis. If an 
application is linked with two libraries containing 
functions with the same name, the functions of the first 
library scanned are used. At compilation/link time, the 
order of the libraries listed on the link command spec- 
ify the precedence. At run time, there are a number of 
environment variables that can alter the order and list 
of libraries scanned. See section 4.1 for our choice of 
technique. Some tools such as Purify, Quantify [9], 
and Sentinel [8] may alter the code contained in the ap- 
plication and libraries or link in new functions not pre- 
viously included in the application. Relinking requires 
a new version to replace the function of the original li- 
brary. The default behavior of our tool is to have a 
wrapper Call the actual function which the application 
would have used, maintaining precise timing measure- 
ment and accuracy with the results of the function call. 
However, there is no requirement that our wrapper has 
to call the real routine. The same linker tricks we em- 
ploy can be used to totally replace the real function or 
augment with new functionality as in[16,17,18]. More 
often, we still call the real function but output addition- 
al data such as hardware simulation streams. 


The granularity of our library tracing technique 
is limited to the function level. Some profilers such as 
tcov and Quantify track hot spots of source code within 
functions. For a theoretical treatment of hot spot pro- 
filing and tracing with source code, see [14]. We con- 
sider the function level granularity quite sufficient for 
our needs, especially when combined with parameter 
and return value tracing. Lack of access to source code 
was considered part of our constraints. 


4. The SLI Toolset 


Tools are provided for varying levels of exper- 
tise. Our primary customer is interested in the same li- 
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braries that our group is and utilizes the interposing 
libraries we’ve already built and the report generators 
we've already written. The second tier user needs 
more or different information than we are providing by 
default and modifies the interposing library source or 
the postprocessing report generators to their own 
needs. Third tier users want to interpose on an entirely 
new library and so must create their own interposing 
library from scratch. SLI includes a collection of 
source, binaries, awk, and per! scripts to assist and 
serve as examples to help each level of user. 


4.1. SLI Data Collection 


Data is optionally collected to three locations. 
First, cumulative information is kept in shared memo- 
ry. This cumulative information includes a count of 
function invocations, how much time was spent in the 
function, and how much time was spent in the function 
plus any descendants it may have invoked. Second, 
data can be written to standard-out or standard-error 
(stdout, stderr) providing trace and parameter informa- 
tion similar to the output of truss or customized output. 
Third, trace data can be collected to disk including the 
process-id (pid), the library, the function, the nesting 
level, how much time has elapsed since the last call 
into the library, how much time the call took, how 
much time SLI added in overhead, and parameter val- 
ues and/or other interesting data. Multiple libraries 
from multiple applications can write to the same file or 
each process can create its own file. 


The data collection can be controlled program- 
matically or through a terminal command line inter- 
face or through a graphical user interface (GUI). While 
we consider it desirable to be able to control data col- 
lection through scripts and without requiring the win- 
dow system to be running, experience has shown 
100% of the user base uses the GUI ignoring the other 
two methods. Perhaps this is a skewed sampling since 
our primary customers are graphics library users. 

Data control includes clearing the cumulative 
information; starting and stopping the stderr output; 
and starting collection to disk through appending to 


existing data or rewinding/truncating the file and start- 
ing new. Additionally, data reduction can be con- 
trolled to reduce disk overhead; per-library flags can 
be controlled to alter wrapper functionality on the fly; 
and inner library nesting call tracing can be controlled 
(e.g. if the application calls a function in the library 
and that library function in turn calls another function 
in the same library, does the user want to capture that 
“inner library” call or only see what was directly in- 
voked via the application). The figure below is a snap- 
shot of the GUI which controls data collection. 


The /dd [5,6,7] command lists the dynamic li- 
braries used by a program. The list is generated at 
compile/link time but the path to locate the libraries 
can be altered at run-time. Following is an example of 
the /dd output of the xterm program (a terminal emula- 
tor program in the X-windows environment): 


% ldd xterm 
libXaw.so.5 
libXmu.so.4 


=>/usr/openwin/lib/libXaw.so.5 
=>/usr/openwin/1lib/libXmu.so.4 


libXt.so.4 =>/usr/openwin/lib/libXt.so.4 
libX1l.so.4 =>/usr/openwin/lib/libX1l.so.4 
libdl.so.1 =>/usr/lib/libdl.so.1 


Llibe.so.1 =>/usr/lib/libc.so.1 

Fach of these libraries can have all or any sub- 
set of their functions wrapped with data-collecting in- 
terposing functions. Furthermore, since xterm is an X- 
windows client application, it is possible to simulta- 
neously profile the X11 server process and correlate 
interactions between the client and server. 


The wrapper functions are invoked through the 
use of the LD_PRELOAD environment variable of the 
loader. The following example shows how the /dd out- 


put for xterm is changed by this environment variable. 
% LD_PRELOAD="./libXll.api.so ./libsli.so” 

% export LD PRELOAD 

% ldd xterm 

-/1ibX11l.api.so=> ./1ibX1l.api.so 
-/libsli.so => ./libsli.so 

libXaw.so.5 => /usr/openwin/lib/libXaw.so.5 
libXmu.so.4 => /usr/openwin/lib/libXmu.so.4 
libXt.so.4 => /usr/openwin/lib/libXt.so.4 
libX1ll.so.4 => /usr/openwin/lib/1libX11.so0.4 
libdl.so.1l => /usr/1lib/libdl.so.1l 
libe.so.1 => /usr/lib/libc.so.1l 


You will note that there are now two versions 
of libX11 listed, our wrapper version and the system 
version. We’ve added api to the name because we have 
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Library File 


1ibX11.api /usr/tmp/SLI_FILE 
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bs 
three interposing versions of libX11. One version con- 
tains the Application Programmer Interface (API) 
functions where we monitor only those functions the 
application programmer has access to. A second ver- 
sion includes all functions contained in the libX11 
source. A third version interesting to our group con- 
tains all the X11 functions with a graphics context pa- 
rameter. Every symbol we’ve included in our wrapper 
library is resolved to us and every symbol we haven’t 
included is resolved through the normal path. The ad- 
ditional library /ibsli.so contains SLI wrapper support 
functions. 


4.2. SLI Reports 


Default reports are provided on the cumulative 
data and the data written to disk. While data collection 
is generic, interesting reports on the collected data can 
be very library specific. Once the data has been col- 
lected, it is often necessary to write a custom postpro- 
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cessing script in sed, awk [4,5,6,7], or perl [15] to 
glean the interesting information. We provide some 
postprocessor filters for our graphics libraries which 
also serve as examples. 


The figure above shows a graph of the cumula- 
tive information collected on the start-up of an xterm 
before any character has been typed. 


From this we note that 2578 calls were made 
into libX11 of which 500 calls (19%) were Xpermal- 
loc. However, the function called the most doesn’t 
necessarily consume the most time, as demonstrated in 
the figure below. 


The times are displayed in microseconds. This 
tells us it took slightly over 1.5 seconds to get the 
xterm started. Even though Xpermalloc was the most 
frequently called function, it is not one of the top 9 
consumers of total time. 


ro SLI Graph 


neers [Name 


Update: secs: 10, 1 —— 120 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 





1565996 =Total= 

308069 XEventsQueued 
273653 XLoadQueryFont 
196911 XSync 

176509 XFlush 

105313 XPending 
83899 XrmGetResource 
81209 XInternatom 
50325 XFillRectangle 
49169 XOpenDisplay 


271 


272 


Y SLI Graph Print 


Pipe To: (Cv) pageview —Ws 614 830 —dpi 72 - 


Show Top: 83 1D —_—— | | OS 


Format: | PostScript | ASCII 





The graph can be updated on regular intervals 
while the program is running, monitoring changes dur- 
ing execution. The graph could also be collecting the 
libX11 calls for all active applications in addition to 
the single xterm. 


Two other sorts are provided by default. The 
time a function took plus all of the interposed descen- 
dants it invoked is sometimes more illuminating than 
the overhead of just the function itself. Second, you 
may want to find the call frequency or time for a spe- 
cific function, and the sort-by-name simplifies finding 
the function. 


The cumulative information can be processed 
through the print option. The default destination is a 
postscript preview program, but any command can be 
given to direct the output to a file, printer, or filter pro- 
gram. Generally, the list of interesting functions falls 
off very fast, so a threshold can be set on how many 
functions are reported. The format can be in ASCIL al- 
lowing postprocessing by custom filter programs. 


The data collected to disk is kept in binary 
form to attempt to reduce the size of the files. A pro- 
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gram called sli_interp is provided to convert the data 

to ASCIL. A postprocessing program can then take that 
ASCII stream and generate interesting reports.The fig- 
ure below shows some sample output from sl/i_interp. 


The first column provides feedback for nesting 
information. If no function calls are nested, all of the 
information associated with a function is kept to one 
line starting with a vertical bar. If other function calls 
are made within a traced function, on open brace de- 
notes the entry of a function and a close brace denotes 
exit from the function. A dashed field means this infor- 
mation can not be provided until the function exits, or 
it has already been provided on the function entry. This 
example starts with the application making a call to 
xgl_inquire. The xgl_ inquire function, in turn, makes 
several libc calls. We also see that calloc makes two 
other libc calls, malloc and bzero. The PID column 
traces which process made the call. The Library col- 
umn shows the library, and the Function column 
shows which function of that library. Nest shows de- 
tailed scope for which functions called which. App! 
shows how much time passed since the last application 
call into this library. Elapsed shows how much time 


Nest Appl Elapsed SLI Data 


1600 98 
0 314 
14 
9 
13535 


558 


SO7TLI 


| 
Pe beer ROlLOODOCCOG4 O!lRHOOCD ODDO OOOO, 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


Function Blocks 
xgl_multimarker 0 
xgl_multipolyline 3982 
xgl_multi_simple pol 0 
xgl_polygon 607 
xgl_triangle strip 53 
xgl_quadrilateral me 0 
xgl_stroke text 0 
xgl_annotation_text 0 


Totals: 

Markers 

Lines 

Chars 

Triangles 

Calls to xgl_context_post () 


Calls 
0 


6121 


607 
673 
0 
0 
0 


Calls to xgl_context_new_frame() 


Calls to xgl_object_get() 


Timing: 

appl+functsli time: 

sli time: 

appl+func time: 

appl time: 

func time: 
was spent on this function call. The SL/ column shows 
how much time overhead SLI introduced to collect this 
information. Data optionally contains parameter or 
other interesting data values before and after the func- 


tion was invoked. 


All of these fields can optionally be omitted 
from the data collection in order to do up-front data re- 
duction. If the user knows only one process is being 
profiled or only one library is being traced, then the 
user can select not to save those fields in the binary 
file. The sli_interp program knows how to handle the 
reduced data files. 


Library specific postprocessors can produce 
useful reports from the s/i_interp output. For example, 
the report in the figure above is a summary of what 
_ graphics primitives were used during an application 
run, how well the application merged multiple primi- 
tives into a single library call, and breaks out the per- 
centages of time spent in the application versus the 
library versus the overhead introduced by SLI. In this 
example, the entire run took just over 2 minutes (appl 
+ func + sli time = 134.20 seconds). Only 14% of the 
time was spent in the application and 81% of the time 
was spent in rendering the graphics. The overhead in- 
troduced by SLI only represented 5% of the total time 
(6.14 seconds). It is fairly obvious that the shared 
memory data produces the lowest overhead, but not so 
obvious that the binary data collected to disk is much 
faster than the formatted ASCII output sent to stderr. 
SLI memory maps the file and accesses it as if it were 
memory, leaving it up to the system to write it back to 
disk when necessary. Contrast that to a formatted print 
statement being output to a scrolling terminal and the 
reason for the different overheads becomes more clear. 
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4.3. Interposing Library Source 


The second and third tier users want to go be- 
yond the default reports and libraries that our group 
has provided. This implies they need to alter the inter- 
posing libraries we provide or create new interposing 
libraries. We provide the source to each interposing li- 


brary we’ve already created and we also provide tools - 


to automate the process of creating a new interposing 
library. There are several steps to creating a new inter- 
posing library that have taken us as little as 20 minutes 
for one library and as long as two weeks for a particu- 
larly difficult library whose default output format just 
wasn’t what we wanted to report to our users. On an 
average, most of our customers have been able to get a 
useful interposing library in one half-day of effort. 


The first step is to create what we call a proto- 
type file. This is a file that consists of the function dec- 
larations for all functions to be traced. Generation of 
this file is generally quite easy. For an Application 
Programmer’s Interface library, the declarations are 
already in a header file. Lint library declarations can 
serve as a source as well. If C program source is avail- 
able (either K&R-C [11] or ANSI-C [12]), then the 
cproto program [13] quickly and easily generates the 
prototype file. Using cproto has been our primary 
method. C++ turns out to be quite difficult to generate 
interposing libraries for and libc provides some special 
challenges. Both of these are discussed in more detail 
in the “Obstacles” section. 

Since libc is quite common to many operating 
systems, we will use a small example of calloc and fol- 
low it through to an interposing library. The prototype 
file would contain: 


void *calloc( size t num, size t size); 
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We allow single line comments and preproces- 
sor directives to be placed in the prototype file as well. 
These are passed directly from the prototype file to the 
generated C code. The prototype file is processed by 
an awk script that generates two files. One file is a 
“translation file” that tracks the total number of func- 
tions in the library, the length of the longest function 
name, and a translation table from a numeric assign- 
ment to the ASCII name of the function. Since this in- 
formation is static once the library is generated, it aids 
the dynamic allocation of arrays during the data collec- 


tion phase of profiling. The number-to-name mapping 
allows compact information to be written to the data 


file in binary. The second file from the awk script is the 
C source for the interposing library. We call this a 
working wrapper template. We use the term working 
because the generated code can compile and be useful 
right away, but we add the term template because truly 
interesting, detailed data collection generally requires 
customization of the generated code. Knowing the 
name and size of parameters is useful, but contextually 
understanding the contents of a complex structure and 
what’s interesting generally requires human interven- 
tion and customization. The generated C code for the 
calloc prototype is: 


void *calloc( size t num, size_t size) 


static char *func_name = “calloc”; 
typedef void *(*real | func_type) 

( size t num, size t size); 
static real func | type “real _func; 
void *return _ value; 
int save_sli_ active = sli _ active; 


SLI_DECLARE 


if (sli_active) 
{ 
if (!real_ func) 
real _ func = (real func type) 
(*sli_resolve(3, func_name) ); 
return((*real_ func) (num, size) ); 


} 


sli_mark(SLI_ MARK SLI ENTER); 
sli_active = fas 
i (!sli_ Lib info. 5) 
sli lib_ info_ 3 = sli find » NTSC) 3 
if (sli_lib info 3->tra_ctl == SLI_TRA) 
fprintf(SLI_ STDOUT, 
"calloc( num=0x%0x size=0x%0x) \n", 
num, size); 
if (!real_ func) 
real func = (real_func_type) 
(*sli_resolve(3, func_name)); 
SLI_PROLOG 
sli_send(SLI_ENTER, 3, 15, SLI_EOP); 
sli_active = 0; 
sli_mark(SLI_MARK FUNC ENTER); 
return _ value = (*real __ func) ( num, size); 
sli _mark(SLI_| MARK _ FUNC _ EXIT); 
sli_active = i 2 
ali ; _send(SLI_ EXIT, 
SLI _EPILOG 
ala ; _active = save _sli_ active; 
sli _mark (SLI | MARK _ SLI BRIT}: 
return return value; 


3, 15, SLI_EOP); 


This example serves to illustrate several points. 
First and foremost, strong typing must be followed for 
the return types and parameter types. Different com- 
pilers have different rules for data type sizes, calling 
conventions, and parameter promotion rules. The tem- 
plate is careful to cast all types. This template is for an 
ANSI-C compiler. The same awk script knows how to 
generate output for K&R-C and C++ with minor vari- 
ations. The function name is placed in a variable per 
function so it can be symbolically referenced in the 
SLI_DECLARE, SLI_ PROLOG, and SLI EPILOG 
macros. These macros are null by default but provide 
a means for all functions to have common code added 
with ease. 


You will see the constants 3 and 15 throughout 
the template. Both of these constants are the number- 
to-name mapping values generated in the translation 
file. The 3 is for libc and the 15 is for calloc in libe. 

Since this function is in libc, there is some add- 
ed code that is not typically found in the templates. 
The initial check for the global variable sli_active is a 
hook to avoid recursing on libc from our interposing 
code (that is to say, we want to trap libc calls from the 
application and the functions we are tracing but we 
don’t want to trap libc calls that our tracing software 
uses). A global variable lacks elegance but provided a 
quick solution to trace all libc functions. A better solu- 
tion will be required to support multiple threads. 


The sli_mark function is used to track the time 
durations of the overhead SLI has introduced and the 
duration of the real function when it is called. There 
are four sli_mark calls. First on entry to the wrapper, 
second just before the real function is called, third im- 
mediately upon return from the real function and 
fourth on exit from the wrapper. For non-libe wrap- 
pers, sli_mark is the first executable statement. 


The first time any function is called in the li- 
brary, some initial one-time overhead is incurred. The 
sli_lib_info is a shared memory page that is used for 
multiprocessing locking and run-time interactive con- 
trol of profile and trace functionality. Likewise, the 
first time a wrapper function is called, we have to find 
the pointer to the real function. This pointer is saved so 
the overhead is only encountered once per function. 


The tra_ctl structure member contains the cur- 
rent value for the “trace to stderr” option. This is the 
only data collection under complete control of the cus- 
tomizing user. The shared memory data collection and 
collection to the file is handled by the support library 
through the s/i_send function. A variable list of param- 
eters can be sent and stored as the data field in the bi- 


nary file. 
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This is as much as can be automated without 
contextual knowledge associated with the functions. 
For instance, the parameters are always just printed as 
a hex value. Often, a parameter might be a complex 
structure requiring human intervention to know what 
information in that structure is interesting and what 
format it should be printed in. Similarly, it would be 
inappropriate to simply save all parameter values to 
the trace file causing a tremendous growth in size. It is 
more appropriate not to save the parameters by default 
and allow human intervention to decide what informa- 
tion is important to save and in what format. 


5. Obstacles Encountered 


There are a number of issues that get in the way 
of implementing an interposing library. Fortunately, 
nearly all are solved, although some require fairly de- 
tailed system knowledge. 


5.1. Finding The Real Function 


This is potentially difficult to figure out for an 
arbitrary operating system. The Solaris 2.3 operating 
system provides a simple interface to accomplish this. 
The disym function is used to find the address of a 
symbol in a dynamically linked library. Solaris 2.3 
provides a special parameter to disym called RTLD_- 
NEXT which indicates to “find the next address of this 
symbol in the list of libraries”. That’s all it takes. For 
standard SVr4 and earlier versions of Solaris, disym is 
available but does not support RTLD_NEXT. The so- 
lution we followed is to traverse the linker structures 
and locate the list of libraries through them. We then 
dlopen each library in the list and loop through the list 
with d/sym looking for the real function. This is not too 
difficult and is somewhat documented [1] but should 
not be considered a normal, supported user interface. 
There are two concerns with this approach. One, be 
careful not to find your own symbol and get caught in 
a recursive loop. Two, for the sake of efficiency, you 
only want to loop through the libraries once to find the 
first function used in a library and keep that handle 
around for subsequent symbols from the same library. 


It is not uncommon for compilers to slightly al- 
ter the names of functions from the source to the object 
file. This usually takes the form of an underscore char- 
acter placed at the front and/or back of the name. The 
disym function properly handles the underscore for the 
programmer. The C++ language adds considerably 
more information including the class and parameter 
type information in the object file function name. This 
presents a problem in specifying the name of the real 
function to dlsym. For C++ compilers that preprocess 
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the C++ code into C and then use the C compiler, you 
can collect the “mangled” name from the C code. For 
C++ compilers that compile directly into object files, 

you may need to use the nm [4,5,6,7] command to de- 
termine the mangled names. 


5.2. C++ 


There are a number of interesting problems 
that arise when attempting to interpose on C++. As just 
mentioned above, finding the real function with the 
name mangling schemes is one hurdle. Generating the 
list of function prototypes can be much more complex 
than with C. Interposing on a programmer’s interface 
is still straightforward, but finding all of the functions 
in a C++ library can be quite difficult. A proper profile 
and trace of a library needs to know where all of the 
overhead comes from. Scanning the source is insuffi- 
cient since C++ may generate a lot of functions for the 
programmer. Examples of generated functions are 
constructors, destructors, copy operators, function 
templates, and virtual functions. The solution appears 
to be to query the library object files for functions and 
reverse that back to function prototype declarations. 
However, this can still be missing critical information 
such as default parameter values and full type informa- 
tion and class member function declarations for com- 
piler created functions. At this time, C++ still remains 
a challenge which we’ve only partially solved. 


5.3. Interposing On libc 


Our support library is written in C and uses 

many functions from libc. Furthermore, most of the li- 
braries we interpose on are also written in C and make 
use of libc. When we finally decided to add libc to the 
list of supported libraries, we found ourselves with re- 
cursive looping problems. The COLA project [17] also 
uses LD_PRELOAD to interpose on system calls and 
reports a similar looping problem. 


Our solution required two steps. First, the in- 
terposing version of libc checks a global variable to 
know if it is being called from an interposing or SLI in- 
ternal function rather than an application or regular li- 
brary function. If so, it calls the real function directly 
without collecting any statistics. As noted in section 
4.3, the use of a global variable will prove to be a prob- 
lem when we want to support multiple threads. This is 
our only global variable and might be fixed by making 
it a thread specific variable. Second, the routine to find 
the real function had to be made “‘libc clean”. That is 
to say, it couldn’t have any references to libc in that 
one function or it had to precisely resolve any libc 
functions it did use directly to the real libc library. 
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3.4. LD_PRELOAD Side Effects 


LD_PRELOAD should be used with care. One 
side effect of this environment variable is that the in- 
terposing functions are loaded for all commands is- 
sued. You may end up collecting data from more 
processes than expected. Also, if the interposing func- 
tions reference symbols expected to be resolved by li- 
braries of the application, other commands might not 
have included those libraries, leaving those symbols 
unresolved causing command execution failure. This 
can be overcome by linking each interposing library to 
the library it interposes. 


5.5. Scoping Issues 


All functions in a library made available to an 
application programmer have to be declared global. 
However, it is possible that the library may have inter- 
nal support routines that it uses but does not expose to 
the application programmer. If a function is declared 
Static, then it can not be interposed. We generate mul- 
tiple versions of interposing libraries for each target li- 
brary. One consists exclusively of the functions 
available through the Application Programmer Inter- 
face (API); a second for all non-static routines in the 
source; and sometimes a third containing a specifically 
interesting subset of functions. 


Global variable references can be a problem if 
not considered carefully. Two functions within a li- 
brary may share access to a global variable. The scope 
of that global variable and whether or not the interpos- 
ing functions are interested in that variable raises some 
issues. If the variable is global to the entire library and 
application, then no problem exists. If the variable is 
shared between two functions in the same source file 
but not global to the library and application, then it 
may not be accessible. Similarly, we have encountered 
some compiler discrepancies on inner library calls. If 
two functions foo and bar are contained in the same 
source file, and compiled to the same object file, and 
foo calls bar, some linkers improperly resolve bar’ s 
reference in foo at link time rather than run-time which 
prevents interposition. This is actually a bug and if en- 
countered, can generally be overcome through compil- 
er or linker command line options. 


5.6. Parameter Handling 


It is reasonable to believe that a function found 
in a library is independent of the compiler it was gen- 
erated from, but in reality, problems such as parameter 
promotion and variable parameter list handling can 
present particularly difficult problems to isolate and 
resolve. 


A hard and fast rule to apply is: use the same 
compiler for your interposing function as the real li- 
brary. K&R-C compilers [11] have different parameter 
promotion rules from the ANSI-C standard compilers 
[12]. If the interposing function does not properly pass 
the parameters down to the real function or properly 
pass the return value back to the application, the inter- 
posing function is useless. 


Variable parameter list functions are an espe- 
cially interesting problem to solve generically since it 
is the responsibility of the called routine to determine 
how much information to read from the stack. The in- 
terposing function has the responsibility to pass the 
correct amount of information down to the real func- 
tion. We solved this two ways. One, for our automati- 
cally generated interposing functions that contain 
variable argument lists, we have some in-line assem- 
bly routines inserted that copy the entire frame of the 
calling routine into the stack for the real function. This 
can potentially copy too much information, generating 
unnecessary overhead, but guarantees the real function 
receives everything. The other solution is to customize 
the interposing routine to know how to parse the stack 
and pass down the correct amount of information. This 
too adds overhead since the stack must be both parsed 
and copied, but insures only the necessary amount of 
information is copied. 


5.7. Multiple Processes and Processors, 
Threads, and Network Implications 


Data collection and interpretation is straight- 
forward if only one process is collecting data, but mul- 
tiple process data collection is too valuable to ignore. 
For example, setting LD_PRELOAD to include all 
graphics libraries before starting the window system 
allows capture of all frame buffer activity for every ap- 
plication. The three data collection points plus the cen- 
tral control area must maintain atomic transactions for 
updates. For multiple processes on multiple processors 
on a single system, this can be handled fairly easily 
through an atomic read-modify-write semaphore in 
shared memory. Multiple threads of the same process 
on multiple processors adds complications. The same 
process-id may have mixed library function entry and 
exit flows. A thread identification needs to be included 
with the process-id to sort out data flow and maintain 
nesting stacks. Multiple processes running on separate 
machines in a network are very difficult to synchronize 
and we are only starting to tackle that problem. 

Semaphore locking adds the potential for dead- 


locks. We only lock when we are ready to update 
shared information and then immediately free the lock. 
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This has only been a problem when the application be- 
ing profiled is killed. Our solution is to have the re- 
quest for a lock time-out and clear the lock itself, 
reporting that the data “may be corrupted”. 


5.8. Timer Overhead 


The resolution of the timer can make a major 
difference in the usefulness of a profiler. Initially, we 
used the gettimeofday libc function, but found the 
overhead of a regular system call took on the order of 
50 milliseconds when we wanted resolution on the or- 
der of nanoseconds. Under Solaris 1, we wrote a de- 
vice driver to provide direct user reads of the system 
clock. Under Solaris 2, a new function gethrtime is 
provided. Both of these gave us around 2 microsecond 
clock resolution improving our accuracy considerably. 


6. Application of the Tools 


The tools have proven quite successful in 
quickly isolating performance bottlenecks in the use of 
graphics libraries and have provided the expected 
feedback to both the application writer and the library 
writer. What was unanticipated was the amount of in- 
formation we could gather and how that information 
could be applied. 


First, once the tools were in place, we found 
adding new libraries to be trivial (with the single ex- 
ception of libc). In one case, the time between the re- 
quest for a new interposing library and the time it was 
ready was only twenty minutes. In general, we are now 
surprised if it takes us more than two days to overcome 
any difficulties in creating a new interposing library. 
We originally anticipated looking at only a few librar- 
ies. The ease of adding new libraries has led to a quick 
proliferation of new libraries on demand and spread 
beyond graphics libraries. Sun customers who were 
shown the tool to assist in graphics performance have 
taken the initiative to create interposing wrappers for 
their own libraries. 

The application developer uses the default re- 
ports and the postprocessing reports to be able to make 
better use of the library. The hardware and library de- 
velopers get feedback on actual usage patterns in the 
library. That information can be applied in many ways. 
The library builder can sort functions that are often 
used together to provide cleaner paging. Application 
regression tests can show what primitives and at- 
tributes are used and which aren’ t (and thus candidates 
for eventual removal). Analysis of benchmarks, dem- 
onstrations, and actual application usage emphasize 
what functions are critical and with what attributes or 
parameter values, what functions are time consuming, 
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and what functions deserve the most attention and pos- 
sible hardware acceleration. 


The ability to capture all of the calls and pa- 
rameters has potential beyond simple playback. If an 
application records a session, encounters a bug, and 
that bug reproduces in the playback, then the odds are 
pretty good that the bug really is in the library or bad 
parameter values were passed to the library by the ap- 
plication. Either way, vendor support and bug report- 
ing can reproduce and analyze the bug without having 
to acquire the application, data, and instructions. Addi- 
tionally, the playback program (or even the wrapper) 
need not actually call the real library but can instead be 
used as a translator. The translator might emit simula- 
tion traces allowing developing hardware to test differ- 
ent schemes against real application data patterns. The 
translator might emit calls to an alternate or new ver- 
sion of the library testing the robustness and perfor- 
mance of the new library before the application has 
actually been ported. Furthermore, the playback is of- 
ten considerably faster than the original application 
since the computations leading to the function calls are 
already made. This means that bug tracing is much 
quicker and easier. Additionally, having the source to 
the playback program rather than the original applica- 
tion means special interposers like Purify can be ap- 
plied to the run by recompiling and linking the 
playback program rather than the application source. If 
the wrappers are compiled with debug flags, then a de- 
bugger can be used to provide functionality on library 
functions you wouldn’t normally have access to. An 
example is conditional breakpoints based on contextu- 
al parameter values to obtain a callback stack on a sys- 
tem call. 


7. Conclusions 


Dynamic library interposition has been ex- 
tremely successful for us. We have been able to exploit 
the detailed information in many different and useful 
ways for many different libraries. The value of tracing 
the parameters in addition to the functions, should not 
be underestimated. Initially developing the tools was 
nontrivial but with the tools in place, our development 
teams are able to make much more informed decisions 
based on real workloads and fewer guesses. We’ve 
generated approximately 40 interposing libraries of 
both Sun supplied libraries and third party libraries. 
Around a dozen applications have used SLI as the pri- 
mary analysis tool with significantly improved perfor- 
mance. Playback to test pre-release hardware and soft- 
ware improved release quality, and hardware 
simulation trace files are currently being generated for 


projects in progress. 
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Abstract 


To function in mobile computing environments, dis- 
tributed file systems must cope with networks that 
are slow, intermittent, or both. Intermittence vitiates 
the effectiveness of callback-based cache coherence 
schemes in reducing client-server communication, be- 
cause clients must validate files when connections are 
reestablished. In this paper we show how maintain- 
ing cache coherence at a large granularity alleviates 
this problem. We report on the implementation and 
performance of large granularity cache coherence for 
the Coda File System. Our measurements confirm the 
value of this technique. At 9.6 Kbps, this technique 
takes only 4 — 20% of the time required by two other 
strategies to validate the cache for a sample of Coda 
users. Even at this speed, the network is effectively 
eliminated as the bottleneck for cache validation. 


1 Introduction 


Callback-based cache coherence [4, 10] in distributed 
file systems has proven to be invaluable for preserving 
scalability while maintaining a high degree of consis- 
tency. This technique is based on the implicit assump- 
tion that the network is fast and reliable. Unfortunately, 
this assumption is often violated in mobile computing 
environments. Network communication in those envi- 
ronments is often slow and intermittent. 


Instead of requiring a client to check the validity . 


of a file on each access, a callback-based scheme places 
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greater responsibility on the server. When a client 
caches a file from a server, the server promises to notify 
it if the file changes. This promise is called a callback. 
An invalidation message is called a callback break. Ifa 
client receives a callback break for a file, it discards the 
cached copy and re-fetches it when itis next referenced. 


When a client with callbacks encounters a net- 
work failure, it must consider its cached files suspect 
because it can no longer depend on the server to no- 
tify it of updates. Upon repair of the failure, the client 
must validate cached files before use. Consequently, 
as failures become more frequent, the effectiveness of 
a callback-based scheme in reducing validation traffic 
decreases. In the worst case, client behavior may de- 
generate to contacting the server on every reference. 
This problem is exacerbated in systems that use antic- 
ipatory caching strategies such as hoarding to prepare 
for failures [1,5]. In these systems, validation traffic is 
proportional not just to the file working set, but to the 
larger resident set. The more diligent the preparation 
for failures, the larger the resident set. The impact of 
this problem increases as network bandwidth becomes 
precious. 


We can address this problem without weaken- 
ing consistency by increasing the granularity at which 
cache coherence is maintained. This makes validation 
more efficient, allowing clients to recover from failures 
more quickly. Taken to an extreme, this idea would re- 
quire maintaining a version stamp and callback on the 
entire file system. If the version stamp remains un- 
changed after a failure, the client can be confident that 
no files have been updated on the server. A callback 
on the entire file system is a very strong statement — 
it means every file cached at the client is valid. How- 
ever a callback break on the file system conveys little 
information — anything in the file system could have 
changed, whether cached at the client or not. A prac- 
tical implementation of this idea requires a choice of 
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granularity that balances speed of validation with pre- 
cision of invalidation. 


In this paper we report on the implementation 
and performance evaluation of large granularity cache 
coherence in the Coda file system [10]. Our results 
show that large granularity cache coherence is well- 
suited for a significant fraction of what Coda users 
typically cache. At 9.6 Kbps, this technique takes only 
4 — 20% of the time required by two other strategies 
to validate the cache. Effectively, this technique elim- 
inates the network as the bottleneck for cache valida- 
tion. At higher bandwidths, the value of this technique 
diminishes, but it is always at least as good as the other 
strategies. 


The paper begins by introducing key aspects of 
Coda. Then we describe the operation of the system 
with multiple granularities. Details concerning the ac- 
tual implementation are presented in Section 4. Finally 
we describe the current status of our system and eval- 
uate its performance. 


2 Coda File System 


Coda is a descendant of AFS-2! [4] that has high data 
availability as its main goal. Like AFS, it provides 
a single, shared, location-transparent name space, and 
maintains cache coherence using callbacks. Files are 
stored in volumes [11], each forming a partial subtree 
of the name space. Volumes are administrative units, 
typically created for individual users or projects. A 
user-level process called Venus manages a file cache 
on the local disk of each client. Venus makes requests 
of servers through the Vice interface using remote pro- 
cedure calls (RPC). Files are identified by fids, which 
are 96 bits long. The first 32 bits of the fid are the 
volume identifier. 


Coda uses two strategies to achieve high data 
availability: server replication and disconnected oper- 
ation. Server replication allows volumes to be stored 
at a group of servers called the volume storage group 
(VSG). At any time, the subset of those servers avail- 
able is called the accessible volume storage group 
(AVSG). When making requests, clients contact all 
servers in the AVSG (though data is fetched from only 
one), and all servers maintain callbacks for objects 
cached from the VSG. If an AVSG grows, clients drop 
callbacks for objects stored at that VSG, because the 
newly available server may contain more recent data. 
Further details on server replication may be found in 
[10]. 


'AFS has evolved since the version from which Coda was de- 
rived, which was AFS-2. The currently deployed version is AFS-3. 
Unless qualified, the term AFS applies to both versions. 


Disconnected operation arises when the AVSG 
becomes empty. To prepare for disconnection, users 
may hoard data in the cache by providing a prioritized 
list of files called the hoard database, or HDB. Venus 
combines hoard database entries with LRU information 
as in traditional caching schemes to implement a cache 
management policy that addresses both performance 
and availability concerns. Periodically, Venus walks 
the cache to ensure that the highest priority items in 
the HDB are cached and consistent with the AVSG. 
Hoard walks may also be requested explicitly by the 
user. If an object in the HDB is invalidated, it is re- 
fetched on the next reference or during the next hoard 
walk, whichever comes first. Hoard walks can create 
bursty network traffic. A hoard walk after an AVSG 
grows results in a validation request for every cached 
object from that VSG. More details on hoarding and 
disconnected operation may be found in [5]. 


3 Protocol Description 


At how many granularities should cache coherence be 
maintained? In principle there can be many levels. An 
obvious mapping onto a Unix file name space would 
suggest a hierarchy of granularities. But the desire 
for a simple implementation led us to use just two: 
files? and volumes. Volumes are attractive as units 
of coherence because they tend to represent groups of 
files that are logically related and hence possess similar 
update characteristics. 


When a client maintains coherence on files, it 
must validate them before use when the AVSG has 
grown. This approach is based on the assumption that 
the newly available server has rendered some file in 
the cache stale, necessitating a check of each one. As 
failures become more frequent, the price of suspicion 
increases. Increasing the granularity of coherence al- 
lows a client to summarize the contents of its cache 
for the purpose of validation. This approach is more 
optimistic, in that it assumes there are sets of cached 
files that have not changed during the failure. 


To summarize cache state by volume, servers 
maintain version stamps for each volume they store. 
The version stamp for a volume is incremented when- 
ever an object in the volume is updated. A client caches 
the version stamp, establishing a callback for the vol- 
ume. When the AVSG grows, the client validates the 
files in a volume by sending its version stamp to the 
server. If the stamps match, all of the client’s cached 
data from the volume is valid. The server grants a 
callback for the volume to allow the client to read the 


*In this paper, we use the term file to refer to single objects in the 
file system, including directories and symbolic links. 
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cached files without any additional communication. If 
the validation fails the client reverts to file callbacks. 


We expect maintaining coherence on volumes 
to be beneficial for collections owned by the primary 
user of a client, and for collections that don’t change 
frequently or change en masse (e.g., system binaries) 
[8]. In Section 5, we show that such collections repre- 
sent a large fraction of the files that users cache. File 
callbacks are more appropriate for volumes that are 
shared or owned by users other than the primary user 
of a client. 


Of course, the client must ensure that version 
stamps are consistent with the data they represent, and it 
must handle updates from other clients, which manifest 
themselves as callback breaks. We discuss these issues 
further in the remainder of this section. 


3.1 Obtaining Callbacks 


A client caches a volume version stamp to prepare for 
the next failure. If a client presents an up-to-date stamp 
after a failure, it is granted a callback on the volume. 
The volume callback is a substitute for file callbacks 
on all the files in that volume. The callback is actually 
on the version stamp. It means the client has files 
corresponding to the version of the volume designated 
by the value of the stamp. 


Before obtaining a volume version stamp, we 
require all files in the cache from that volume to be 
valid and have callbacks. This ensures the files at 
the client correspond to the version stamp it receives. 
Since validating the files could be expensive, the client 
should employ a policy that balances this cost with the 
expected value of having a volume version stamp in 
case of a failure. We discuss policy further in Section 
4.3. For volume callbacks to be effective, there should 
be more than one file cached from the volume. 


If the client holds a volume callback and fetches 
a new file, the server establishes a file callback for 
the new file. This is not necessary for correctness, 
but it is useful for performance. Although one could 
imagine not establishing the file callback to conserve 
server memory, granting the file callback in this case 
requires no additional network traffic, and gives the 
client something to fall back on should the volume 
callback be broken. 


3.2 Handling Callback Breaks 


When a file is updated by a remote client, the server 
breaks callbacks to all other clients holding a callback 
for that file or its volume. If a client holds callbacks 
on both the file and the volume, the server breaks the 
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callback on the file. The client interprets this as a 
callback break on the volume as well, and erases its 
version stamp. Note that if a client holds a volume 
callback, it will receive a callback break even if the 
updated file is not in its cache. This is false sharing, 
and if frequent, may indicate that the granularity of 
cache coherence is too large for that volume. The client 
should not blindly reestablish the callback when it is 
broken, because updates exhibit temporal locality [2, 
9]. Not only would this be a waste of bandwidth, but it 
would also harm scalability. The client’s policy should 
take this into account when determining whether the 
volume callback should be reestablished. 


The presence of both volume and file callbacks 
means clients must decide what kind of callback to 
obtain when one is broken. Suppose a client validates 
a version stamp for a volume, and it receives a volume 
callback. At this point it has no file callbacks. If the 
volume callback is broken, the client must validate its 
cached files from that volume before it can reestablish 
the volume callback. In terms of network usage, this is 
equivalent to recovery from a failure without volume 
callbacks. In effect, the client has delayed validation 
of individual files. 


In the situation above one might imagine obtain- 
ing file callbacks in the background in case the volume 
callback is broken. This eager strategy assumes a re- 
mote update will occur before the next failure. How- 
ever, this defeats the purpose of obtaining a volume 
callback. Instead, we employ a lazy strategy, obtain- 
ing file callbacks only if the volume callback is actually 
broken. If no remote updates occur between failures, 
we have saved the network bandwidth and server mem- 
ory that would have been required to validate and obtain 
file callbacks. 


4 Implementation 


We layered volume callbacks on the existing callback 
mechanism as much as possible. Code changes were 
required in the Vice interface, the server, and Venus. 
We discuss these changes in the following subsections. 


4.1 Vice Interface 


We added two new calls to the Vice interface that 
manipulate version stamps, which were already be- 
ing maintained by each server for replication. The first 
new call is ViceGetVolVS, which takes a volume 
identifier, and returns a version stamp and a flag indi- 
cating whether or not a callback has been established 
for the volume. 
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ViceGetVolVS(IN VolumelId Vid, 
OUT RPC2 Integer VS, 
OUT CallBackStatus CBStatus) ; 


The second call, ViceValidateVolLs, takes 
a list of volume identifiers and version stamps and 
returns a code for each indicating if it is valid, and 
if so, whether a callback has been established for the 
volume. The structure RPC2-CountedBS consists of 
a length field and a sequence of bytes. 


ViceValidateVols ( 
IN ViceVolumelIdStruct Vids[], 
IN RPC2_CountedBS VS, 
OUT RPC2_CountedBS VFlagBS) ; 


Besides the two new Vice calls, there are also 
new parameters to existing calls that perform updates 
(mkdir, rename, etc.). 


4.2 Server side 


Server code is required to support the new Vice RPCs, 
and volume callbacks themselves. We added about 
400 lines of code to the server, which consists of ap- 
proximately 14,500 lines of code excluding headers 
and libraries. Most of the changes involved supporting 
the new RPCs (200 lines) and debugging and printing 
statistics (150 lines). The remainder of the changes 
were for gathering statistics. 


We minimized changes to data structures and 
code involving callbacks by designating an unused fid 
((Volumeld).0.0) to represent an entire volume. We 
modified the callback break routine to break callbacks 
not only for a file, but also for the volume that contains 
it. 

Updates change the volume version stamp, 
whether they are made remotely, or by a local client. 
When a client updates a file, it receives a status block 
containing the file’s new version information and at- 
tributes. The status block is shown in Figure 1. Simi- 
larly, the client must be able to observe the effects of its 
updates on the volume version stamp, without receiv- 
ing callback breaks or sending additional messages. 


We considered two approaches for updating the 
client’s version stamp when it performs an update — 
having the client compute the new stamp, or having 
the server compute and return it. The advantage of 
having the client compute the new stamp is no addi- 
tional changes need to be made to the Vice interface. 
Unfortunately, since the server must maintain version 
stamps anyway, this approach duplicates a good deal 
of code, and is more difficult to test and maintain. 


We chose to have the server compute and return 
the new version stamp. We have added three parame- 
ters to Vice calls that involve updates: 


e the old version stamps 
e the new version stamp 
e the callback status 


When a client performs an update, it sends its 
copy of the volume version stamp to the server along 
with the other parameters for the operation. If the 
client’s stamp is current, the server returns the new 
stamp and a callback for the volume. If it is not, the 
server returns a zero stamp, and no callback. If the 
client does not have a stamp, or does not wish to obtain 
a volume callback, it simply sends a zero stamp. This 
is guaranteed never to match at the server. 


This process is complicated by concurrency con- 
trol. Files involved in an update are locked for the du- 
ration of the operation. For performance reasons, the 
server cannot lock a volume for the entire duration of 
an update. Therefore, it is possible for updates to dif- 
ferent objects in a volume to be interleaved. To detect 
this, the server updates the client’s version stamp along 
with its own, and checks for a match at the end of the 
call. 


There are operations other than file updates that 
change volume version stamps. We made a few addi- 
tional changes to two server libraries to ensure call- 
backs would be broken when these operations oc- 
curred. One of the libraries supports debugging; the 
other is part of the resolution subsystem [7]. 


Our implementation was complicated by a num- 
ber of race conditions, pertaining to server replication, 
that manifested themselves during initial testing. These 
race conditions were present in the original code from 
which Coda is derived, but were triggered when clients 
eagerly acquired volume callbacks. 


For example, the callback processing code is 
structured to prevent a server from adding a callback for 
a fid while breaking a callback for that fid. Callbacks 
are maintained at all servers in the AVSG. The race 
condition occurs when aclient receives callback breaks 
from a subset of the AVSG and immediately tries to 
reestablish its volume callback. This request is sent to 
all the servers in the AVSG; this may include ones still 
breaking the callback. This used to cause the servers 
to crash. We fixed this by returning the callback status, 
and having the server not establish callbacks in this 
situation. 
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typedef RPC2-Struct 


{ 
RPC2_Unsigned InterfaceVersion; 
ViceDataType VnodeType; 
RPC2_Integer LinkCount; 
RPC2_Unsigned Length; 
FileVersion DataVersion; 
ViceVersionVector VV; 
Date Date; 
UserId Author; 
UserlId Owner ; 
CallBackStatus CallBack; 
Rights MyAccess; 
Rights AnyAccess; 
RPC2_Unsigned Mode; 
VnodelId vparent; 
Unique uparent; 


} ViceStatus; 


Figure 1: Vice Status Block 


This figure shows the Vice status block, which is returned 
for the objects of most Vice calls. It includes version 
information for the object, whether or not the server has 
extended a callback promise for it, and the access rights of 
the requesting user and the anonymous user 
System:Anyuser. 


4.3 Client side 


Most of the logic for supporting volume callbacks is in 
Venus. In addition to using the new RPCs, Venus must 
cope with replication, and decide when using volume 
callbacks is appropriate. The changes represented an 
addition of 700 lines to about 36,000 lines of code 
excluding headers and libraries. 


The implementation of Venus is layered with 
respect to files and volumes. The changes for volume 
callbacks are concentrated in the volume layer, leav- 
ing the heart of Venus unchanged. We augmented the 
volume data structure to store a volume version stamp, 
the status of a volume callback, and summary statistics 
such as the number of callbacks established, broken, 
and cleared. 


There are a number of background processes 
within Venus that run periodically. The hoard daemon, 
for example, runs a hoard walk every ten minutes. The 
volume daemon checks each volume to effect state 
changes every five seconds. Ourcode is structured such 
that volume version stamps are likely to be obtained or 
validated in the background by one of these daemons. 
This greatly reduces the chance that the cost of these 
tasks is incurred on demand during a user request. 


1994 Summer USENIX - June 6 - 10, 1994 - Boston, MA 


4.3.1 Policy 


As mentioned in Section 3, Venus should have some 
policy to determine when to obtain a volume callback. 
The optimal policy would obtain a volume callback 
only if a failure was going to occur and be repaired 
before the next remote update. Otherwise, either the 
volume callback would be broken, or the next valida- 


- tion would fail. 


One could invent a variety of policies to ap- 
proximate the optimal one. We decided to use a simple 
policy, in which Venus obtains volume callbacks only 
during hoard walks. We chose this policy for several 
reasons. 


1. Volume version stamps are intended to be useful 
in preparing for failures. This is synonymous with 
the purpose of hoarding. 


2. During a hoard walk, cached files are validated 
anyway. The additional overhead of obtaining a 
version stamp for each volume is low. 


3. This strategy satisfies our scalability concerns. If 
a volume callback is broken, the client will not 
request another one until the next hoard walk. 


4. Since hoard walks are periodic, the window of 
vulnerability to failures is bounded. For a client 
to lose the opportunity to validate files by volume, 
a remote update would have to be followed by a 
failure within one hoard walk interval (typically 
ten minutes). In this case, the client is no worse 
off than it was before the use of volume callbacks. 


This policy also copes nicely with voluntary dis- 
connections, when a user deliberately removes a lap- 
top computer from the network. In our environment, 
many users have both desktop and laptop computers. 
While at work, they work from the desktop computers, 
leaving their laptops connected nearby. Some users 
modify files hoarded on their laptops from their desk- 
top. Before disconnecting, they run a hoard walk on 
the laptop to fetch the files they just changed from 
the desktop. While connected, the laptop observes the 
remote updates to volumes that are referenced in its 
hoard database. These volumes are prime candidates 
for volume callbacks. A policy that becomes more 
conservative about obtaining volume callbacks when 
remote updates occur would be unlikely to obtain them 
in this case. In contrast, our policy takes advantage of 
explicit hoard walks as hints of imminent disconnec- 
tion. 
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4.3.2 Access Rights 


Directories in Coda have access lists associated with 
them that specify the operations that a user or group 
of users may perform on them. Venus caches access 
information to perform access checking locally. It ob- 
tains the information from the Vice status block, which 
is aresult of most Vice calls. The access cache for a di- 
rectory consists of a fixed number of entries containing 
a user identifier and that user’s rights on the directory. 
Entries are considered valid when they are installed 
from the Vice status block. They are considered in- 
valid (or suspect) if the object is invalidated, the user’s 
authentication tokens expire, or if the AVSG grows. 


When files are validated in groups, such as by 
volume, access information is not returned for the in- 
dividual files. To avoid sending messages to the server 
to check access information, Venus must use the access 
cache more aggressively than it did in the past. If an 
object is deemed valid, clearly its access rights have 
not changed. Venus now considers entries in the rights 
cache for a file valid if the file is valid, and the entry 
corresponds to a user who is authenticated. 


4.3.3 Effects of Replication 


Coda’s support of replicated volumes affects the 
client’s handling of volume version state in two ways. 
First, Venus communicates with the AVSG as a group, 
sending the same copy of each request to each member 
of the group. This is performed by the underlying RPC 
protocol, which was designed to support remote pro- 
cedure call to a set of machines in parallel. Because of 
this, a validation request must contain the stamps for 
all the servers in the VSG. Each server simply checks 
the one corresponding to it. 


Second, Venus must collate multiple responses 
to itsrequests. When requesting version stamps, it must 
store the stamp for each server that responds. When 
validating version stamps, all servers must agree that 
the stamps are valid before Venus can declare them 
valid. Similarly, all servers must agree that a callback 
has been established before Venus can assume it has a 
callback on the volume. 


5 Status and Evaluation 


Servers supporting volume callbacks have been in use 
for several months. The corresponding Venus is cur- 
rently in alpha test, and we expect to release it for 
production use shortly. 


The primary reason for using large granularity 
cache coherence is to validate a client’s cache quickly 
after a failure is repaired. In this section, we present 


measurements of cache validation times for five typical 
Coda users under a variety of conditions. 


5.1 Experiment Design 


The time required to validate a client’s cache after a 
failure is the figure of merit for our experiments. We 
call this the recovery time of the cache. Obviously, 
recovery time depends on the contents of the cache. For 
the experiments, we gathered the hoard profiles of five 
Coda users, summarized in Table 1. A hoard profile 
is the input to a program that updates the HDB. These 
profiles are used primarily for laptops. To broaden our 
study, we deliberately chose users whose profiles were 
dissimilar. 


We performed the experiments with a single 
client and server, both DECstation 5000/200s with 32 
MB of memory, running Mach 2.6. The client used a 
50 MB Coda file cache. The machines were connected 
via Ethernet. To emulate slower networks and inject 
failures, we used a failure library linked into Venus and 
the server. The library allows packets to be delayed or 
suppressed according to a filter, which specifies under 
what conditions the mischief is to occur. For example, 
one might request packets to a certain host be dropped 
with some probability, or delayed as if the network 
were a lower speed. Requests to insert and remove 
filters are issued to the failure package via RPC. 


We began each experiment by initializing the 
hoard database with the profiles for a single user. Then 
we ran a hoard walk, and partitioned the client from the 
server. Once the client detected the failure, we healed 
the partition, caused the client to notice the server was 
up, and immediately ran a hoard walk. We measured 
the time it took for Venus to validate its cache entries, 
from when it noticed the server was up to the end of 
the hoard walk. We assume no updates on cached 
volumes were made to the server by any other client 
during the failure. Although this is the best case, we 
believe it is an important common case in intermittent 
environments. . 


5.2 Parameters Explored 


We studied recovery times over four network speeds 
and three validation strategies for each user. The net- 
work speeds were 10 Mb/sec, representing Ethernet; 
2 Mb/sec, representing packet radio (such as NCR 
WaveLan!™), 64 Kb/sec, representing ISDN, and 9.6 
Kb/sec, representing a typical dialup connection. The 
validation strategies were “NoOpt’, “Batched’’, and 
“VCB”. The NoOpt strategy validates an object by 
fetching its status block from the server and comparing 
it to the cached copy. This corresponds to the Vnode 
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Volume Type 


X11 
TEX 
System 
Cboard 
Other tools 
Coda binaries 





Coda sources 






Kernel sources 
User 1 personal 
User 2 personal 
User 3 personal 
User 4 personal 
User 5 personal 
Other personal 
Total files 
Total volumes 
Cache size (MB) 









Number of Files Cached 
38 127 133 125 142 






















Table 1: Contents of Hoard Profiles for Five Coda Users, by Volume 


This table characterizes the contents of the hoard profiles for the five Coda users studied in the experiments described in Section 
5.1. Entries represent the number of files hoarded from each volume by each user. 


The system volume contains system binaries, utilities, and include files. Cboard is a project volume for a calendar program; its 
maintainer is user 5. “Other tools” refers to five volumes containing utilities such as GNU-Emacs and less. The “Coda binaries” 
volume contains Coda-related programs that many users hoard. The “Coda sources” category is of interest primarily to Coda 
developers. It consists of two volumes containing scaffolding for the project tree, libraries, include files, and sources. User 4’s 
personal files are split into a home volume and a volume solely for object files. “Other personal” is a set of five volumes belonging 
to users other than the ones we studied. Two of those volumes contain versions of kermit that most users hoard, and one 


contains a popular window manager. 


operation GetAttr [6]. The Batched strategy allows 
a group of files to be validated in one RPC. More specif- 
ically, in Coda up to 50 fids may be piggybacked with 
version information on a GetAttr request. The VCB 
strategy validates objects by volume using previously 
cached version stamps. These are also batched; for 
these experiments only 1 RPC is needed to validate the 
volumes. 


Although the current production version of Coda 
uses the Batched strategy, we measured the NoOpt 
strategy for two reasons. First, it allows our results 
to be compared to file systems that do not batch val- 
idations, such as AFS. Second, even though batch- 
ing takes less time and bandwidth at any speed than 
NoOpt, it has some disadvantages at low bandwidth. 
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Batching can result in large request packets — nearly 
3KB in Coda. These requests stress the underlying 
RPC protocol, because retransmissions at low band- 
width can starve other requests, and cause Venus to 
declare servers down. Indeed, we experienced such 
failures while conducting our experiments! It may be 
more appropriate to use a smaller batching factor for 
low bandwidth networks. Latency is also significantly 
affected by request size when bandwidth is low. Cur- 
rently a demand (user) request for one file will cause 
a batch validation of up to SO files, which incurs ad- 
ditional latency that could be deferred to background 
processes. 


Batching of volume validations does not have 
as great an impact on the system as batching of file 
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1OMb/s 


9.6Kb/s 


Validation 


Strategy 


NoOpt 
Batched 
VCB 
NoOpt 
Batched 
VCB 
NoOpt 
Batched 
VCB 
NoOpt 
Batched 
VCB 


12.8 (1.4) 
5.4 (5) 
dus (5) 
67.8 (1.4) 
23.8 (2.8) 
4.8 (.5) 


Recovery Time in Seconds 


User 2 


11.0 (2.6) 
4.1 (4) 
3:7 €3) 


102.8 (.9) 
31.4 (2.5) 


a3 (5) 


User 3 


226.1 (2.2) 
80.9 (15.8) 
8.9 (6) 


User 4 
315° (5) 
11.0 (0) 
10.0 (1.3) 


63.6 (1.6) 
24.3 (5) 
9.6 (5) 


342.4 (4.0) 
103.1 (9.7) 


11.3 ¢5) 


46.0 (1.1) 
19.0 (8) 
Lid ¢8) 


87.5 (2.2) 
36.5 (9) 
17.8 (9) 


453.8 (9.7) 
136.3 (8.7) 


20.3 (.9) 


Relative 
Times 
100.0% 
38.5% 
35.5% 
100.0% 
42.9% 
34.9% 
100.0% 
40.6% 
18.5% 
100.0% 
31.5% 
4.2% 





Table 2: Cache Recovery Time (Seconds) 


This table presents the time in seconds needed by a client to validate cached files when it discovers a server is up. The cache 
contents are determined by the hoard profiles for each of the five users. The rightmost column is the average reduction in 
validation time compared to NoOpt for each of the other two strategies. The reduction is given as a percentage, and is calculated as 
(100 x tother)/tNoOpt: These results are conservative in a number of respects, as explained in Section 5.5. 


The experiments were conducted with DECstation 5000/200s as the client and server, and volumes stored at one server. 
Measurements were taken over an Ethernet; for the three slower speeds, an emulator was used to delay packets. Each entry is the 
mean and standard deviation (in parentheses) of the most consistent eight trials from a set of ten. 


validations because clients have information on many 
fewer volumes than files, and volume identifiers and 
version stamps are much smaller than their counterparts 
for files. 


5.3. Results 


Our results confirm that VCB compensates success- 
fully for the reduction in bandwidth. Table 2 presents 
our observations. For all users and networks, recovery 
times are smallest using VCB, followed by the Batched 
and NoOpt strategies. There is variation across users 
proportional to the number of files cached. The im- 
provement increases as bandwidth decreases. At 9.6 
Kb/sec, where VCB is likely to be most important, re- 
covery time takes only 4—7% of the time required by 
NoOpt, and 11-20% of the time required by batching. 
At higher bandwidths, the value of VCB diminishes, 
but it is always at least as good as the other two strate- 
gies. A glance at Table 2 reveals that the results for 
VCB at 9.6 Kb/sec and 10Mb/sec are not significantly 


different. 


An unexpected result was that the recovery time 
using VCB on aslow network was not constant over all 
users. We expected the bottleneck in this case would 
be the network. Since only one RPC was required to 
validate the volumes, we thought the recovery times 
would be similar. We observed recovery times pro- 
portional to the number of files cached, indicating the 
bottleneck is Venus. Most of its time is spent on two 
tasks: marking cached objects suspect when the server 
appears up, and performing the hoard walk, which in- 
volves iterating through all of the objects in the cache 
to ensure they are valid. 


The number of callbacks at the server can be de- 
rived from Table 1, from the number of objects and vol- 
umes each user hoards. In these experiments, clients 
using the Batched or NoOpt strategies obtain callbacks 
for each file validated. Clients using VCB obtain call- 
backs only for the volumes they validated. The number 
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of callbacks obtained by clients using VCB is less than 
3% of the number obtained by the other two strategies. 


The results presented are for the case in which all 
validations succeed. Over longer periods, or for more 
active volumes, some validations will fail because of 
updates at the server. As long as some validations 
succeed, VCB will still perform better than the other 
strategies. The only case for which VCB is worse is 
if every volume validation fails, and then it is worse 
by 1 RPC. Considering what users hoard, this case is 
unlikely. 


The volumes most likely to change are the per- 
sonal or project volumes of other users, as shown in 
Figure 2. All of the users we studied hoard files from 
other user volumes; however, in all but one case they 
represent less than 1% of the total files. Therefore val- 
idating these files individually if necessary does not 
have a large impact on recovery time. Further, some 
user volumes were inactive during the period shown in 
Figure 2. 


The next most frequently changed set of vol- 
umes are the Coda and kernel source volumes, which 
are shared by up to six project members. These change 
relatively slowly; Figure 2 indicates that the most ac- 
tive of these volumes, the Coda source area, was com- 
pletely unchanged for half of the days in the period we 
studied. Since update traffic is bursty, the results from 
Figure 2 are conservative, especially for intermittent 
environments. Thus we are confident that the benefits 
listed in Table 2 are realistic. 


5.4 Overhead 


Of course, fast validation isn’t free. There are several 
sources of network overhead caused by volume call- 
backs — obtaining callbacks, breaking callbacks, and 
validating volumes. Obtaining a callback on a volume 
requires validation of every cached file in the volume. 
Since this is already done by hoard walks, and the num- 
ber of volumes is small compared to the number of files 
cached by clients, the additional overhead to obtain the 
volume callback is low. 


In the worst case, all the volumes from which a 
client has cached files are being updated actively. The 
client then loses every volume callback it obtains, and 
its volume validations fail. If the sharing is false, the 
effort expended to get volume callbacks is wasted. For- 
tunately, callback requests and breaks are small mes- 
sages, well under 100 bytes. Since these occur only 
once in every hoard walk period, the network over- 
head is still low. The failed volume validation costs 
one extra RPC. For volumes from which many files are 
cached, the cost of validating the files renders that RPC 
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Figure 2: Daily Update Frequency of Volumes 


This figure shows how often volumes used in our 
experiments are updated on a daily basis. The data was 
gathered from daily backup logs from January through 
March 1994. Each bar indicates the percentage of the days 
in the period during which at least one object in the volume 
was updated. We show volumes in the “Other users”, 
“Other tools”, and “Coda sources” categories separately, as 
well as both of User 4’s personal volumes. 


insignificant. If the sharing is real, the overhead due to 
volume callbacks is likely to be insignificant compared 
to the cost of re-fetching the shared data. Overall, the 
benefits of volume callbacks far outweigh the costs. 


5.5 Accuracy of Results 


The results in Table 2 understate the benefits of VCB in 
a number of respects. First, our failure library underes- 
timates the delay for a given network speed. Emulation 
is performed by a package which intercepts outgoing 
packets and delays them based on the size of the packet, 
the network speed requested in the filter, and the de- 
lays for any packets queued ahead of the one to be sent. 
The delay is a simpleminded calculation, and does not 
take into account overheads such as UDP and IP header 
sizes, or IP fragmentation. A comparison of emulated 
and real times at 9.6Kbps is shown in Table 3. 


Second, we used volumes with only one replica, 
when most volumes in Codaare triply replicated. Since 
many networks do not support multicast, an RPC re- 
quest to an AVSG with more than one member is cur- 
rently sent as separate messages to each member. If the 
network is the bottleneck, the time required to validate 
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Packet Time (seconds) 
Size Real 





Table 3: Emulated vs. Real RPC at 9.6 Kbps 


This table compares the round trip time for an RPC request 
and response of the same size, using the network emulator 
set to 9.6 Kbps over an Ethernet, and using a dialup SLIP 
link nominally rated at 9.6 Kbps. The experiments were 
conducted using an i386-based laptop as the client and a 
DECstation 5000/200 as the server. RPC packet headers are 
60 bytes long; the first line gives the times for a null RPC. 
We show the mean and standard deviation for the most 
consistent eight trials from a set of ten. The large standard 
deviations for 4060 bytes (emulated) and 1060 bytes (real) 
were due to retransmissions during one or more runs. 


cached files for each of the strategies in Table 2 will be 
proportionately larger. 


Last, caches typically contain more than what 
is hoarded. This occurs for several reasons — name 
space exploration, objects left over from other tasks, 
and execution of a task to find files not included by 
hoard profiles. 


Each of these effects underestimates the savings 
due to VCB, especially over low bandwidth networks. 


6 Conclusion 


This work was motivated by the demands of mobile 
computing. Large granularity cache coherence is valu- 
able in that context because it allows a high level of 
consistency to be preserved even when communication 
is intermittent or expensive. But we anticipate that this 
mechanism will have broader applicability. For ex- 
ample, we expect it to be valuable in systems such as 
AFS, where recent measurements indicate over 50% 
of requests to servers are for fetching status [12]. We 
conjecture that a significant fraction of these are vali- 
dation requests for files that once had callbacks. These 
callbacks may have been lost due to failures or expiry, 
since AFS-3 callbacks are effectively leases [3]. 


Another argument for maintaining cache coher- 
ence at a large granularity has been put forth indepen- 
dently by Wang and Anderson [13]. They proposed 


maintaining cache coherence on clusters of files, such 
as subtrees. Their primary motivation is to reduce 
server state rather than communication. 


Regardless of specific motivation, we are con- 
vinced that large granularity cache coherence is a prac- 
tical and important technique for distributed comput- 
ing. Our experience and measurements confirm that it 
is valuable in preserving the quality of file access in in- 
termittent networking environments. Large granularity 
cache coherence costs little, and offers the potential for 
big savings. 
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Abstract 


One way to provide mobile computers with access to 
the resources of a network, even in the absence of 
communication, is to predict which information will 
be used during disconnection and cache the appropriate 
data while still connected. To determine the feasibility 
of this approach, traces of file-access activity for three 
diverse application domains were collected for periods 
of over two months. Analysis of these traces using 
traditional and new measures reveals that user working 
sets tend to be small compared to modern disk sizes, 
that users tend to reference the same files for several 
days or even weeks at a time, and that different users 
do not tend to write to the same file except in highly 
constrained circumstances. These factors encourage 
the conclusion that an automated caching system can 
be built for a wide variety of environments. 


1 Motivation 


The value of mobile computers is that they allow 
users to work while disconnected from their normal 
resources. However, mobile computers typically have 
a great deal less disk storage than is available via re- 
mote mounting on connected networks. This forces 
mobile computer users to face a challenging problem 
of making sure their limited disks always store the in- 
formation they will need while disconnected from other 
machines. Requiring users to deal explicitly with this 
issue puts a heavy burden on them, and the realities of 
modern software methods make it nearly impossible 
for users to identify all the files they actually need.1 A 
fully automated caching mechanism that predictively 
stored all files a user needs on his mobile machine 

*This work was partially supported by the Advanced Research 
Projects Agency under contract N00174-91-C-0107. 

1 For example, starting the X Window System requires access to 


10-30 files or more. The identities of many of these are surprising 
even to expert systems programmers [6]. 
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would be very valuable. Such a mechanism is only 
practical, however, if information that can be gathered 
automatically fully captures the typical user’s working 
set of files. 

A prototype system of this sort was developed under 
CMU’s Coda system [6, 14] and proved successful, but 
was inconvenient for the user and was tested only in 
one application environment. 

We undertook this research to investigate the practi- 
cality of automatic file caching for mobility in a wider 
set of application domains, and to discover new and 
less-burdensome ways of identifying files to be cached. 
Our approach was to collect traces of file-access activ- 
ity in several environments over a long period of time, 
and analyze them for feasibility and predictability of 
caching. 

We chose to collect our own traces, rather than us- 
ing existing traces, for three reasons. First, few ex- 
isting traces are long enough. Because most existing 
traces collect read/write activity, a few weeks of data 
is sufficient to tax resource limits. We were interested 
in observing longer-term periodic behaviors such as 
end-of-the-month billing work in an accounting depart- 
ment, which therefore required a several-month trace 
to establish a pattern. 

Second, existing traces have tended to be limited to 
an engineering application domain, usually program- 
ming. We wanted to investigate the behavior of non- 
programmers as well, in the twin beliefs that this type 
of user will eventually be the largest population of 
portable users, and that these users may behave quite 
differently from programmers. 

Third, most previous studies have generally been 
limited to analysis of working-set sizes and file-system 
performance data [1, 2, 6, 11, 14]. The latter is not 
relevant to this research, and the former, while very 
important, is not in itself sufficient to characterize the 
user behaviors critical to successful mobile caching. 

Successful automated caching requires two charac- 
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teristics in user behavior: 


e The working set of files, as observed over a period 
of days or weeks, must be small enough to fit on 
a portable’s disk. 


e It must be possible to predict the working set in 
advance, using hints such as the current working 
set, historical file access patterns [15], or known 
patterns in user behavior. 


Analysis of the data we have collected shows that 
these characteristics are present in anumber of different 
application domains. 


2 Methodology 


We collected our traces at Locus Computing Corpo- 
ration, a software development and consulting firm, 
during the summer of 1993. One of Locus’ prod- 
ucts, PC/Interface (PCI) [8], is a DOS-to-UNIX file 
system implemented as a pseudo-disk driver on a DOS 
machine which communicates via Ethernet to a file 
server on the UNIX system, making the UNIX file sys- 
tem available to the DOS users as native PC files. In 
the environments monitored, the local DOS filesystem 
was used to store some applications software, but all 
shared corporate data was accessed via PCI. The UNIX 
server for PCI was modified to log opens, closes, and 
deletes of files. By avoiding read/write logging, we 
minimized the performance impact and kept the log 
files small. Log entries contain an operation type and 
subtype (e.g., open for read), the UNIX timestamp in 
seconds, the UNIX UID of the invoker, the process ID, 
the absolute pathname of the file, and the size of the 
file. 

Three different user environments were monitored. 
In the first, referred to as “personal productivity,” the 
server was a machine that acted as the network file- 
system for 47 users running business-oriented applica- 
tions such as e-mail, project and calendar scheduling, 
and word processing. These users did not tend to store 
important files on their own machines, so they gener- 
ated high activity at the server. This server was traced 
for 1563 hours (65.1 days, or 9.3 weeks),” recording 
4,637,924 accesses. 

In the second environment, referred to as “program- 
ming,” the server was a cluster of 10 machines running 
IBM’s Transparent Computing Facility, an adaption 
of the Locus distributed operating system [12], which 
provides a single-system image to users of multiple 


250 days into this trace, there was a data gap of approximately 48 
hours due to an administrative error. It does not appear that this gap 
affects the validity of the analysis. 


machines. Each machine ran a separate PCI server, 
and logs from these servers were later combined for 
analysis. Most of the users of this server were pro- 
grammers working on DOS-based software. Because 
they performed much of their work locally, accessing 
the shared server mostly to retrieve or update shared 
source files, they generated relatively little server activ- 
ity. The traces on this server essentially reflect commits 
to a shared database, while omitting most localized file 
activity. This server was accessed by 64 users and 
was traced for 1693 hours (70.5 days, or 10.1 weeks), 
recording 93,719 accesses. 

In the third environment, referred to as “commer- 
cial,” the server was a single machine used by the ac- 
counting department to run a commercial accounting 
application. The master corporate accounting database 
was kept on the UNIX server, but all access to this 
(shared) database was via DOS workstations running 
the commercial package. This server was accessed by 
7 users and was traced for 1257 hours (52.4 days, or 
7.5 weeks), recording 371,830 accesses. 

The nature of the traced environment (local files 
stored on PC’s, with shared files stored remotely) par- 
allels the expected behavior of mobile users, who will 
probably store heavily-used applications locally? but 
make extensive use of shared resources when they are 
network-connected. However, based on preliminary 
analysis of these traces, we also generated two mod- 
ified traces that omitted certain characteristics we felt 
might be absent on portable platforms due to different 
software and user behaviors. For the commercial en- 
vironment, we reduced all file sizes to a maximum of 
1 MB, on the theory that very large databases would 
be represented by smaller slices in a portable environ- 
ment. This change primarily affected the statistics on 
working-set sizes and the amount of data involved in 
write conflicts and attention shifts, which are measures 
of file sharing and working-set variability that we will 
define in Section 3. For the productivity environment, 
we eliminated all references to fax spooling and mail 
files, because such files are handled in a queued (as 
opposed to shared) manner in disconnected environ- 
ments. This change affected all of the statistics we 
analyzed. These two data sets are referred to as the 
“reduced commercial” and “reduced productivity” en- 
vironments in the tables and graphs. 

Once the traces were collected, we canonicalized 
them using a simple awk script that converts relative 
pathnames to absolute form, correlates each close with 
the corresponding open and produces an output line 
whose format is independent of the operation type to 
make subsequent processing easier. These canonical- 


3We hope that even these will eventually fall under the purview 
of an automated caching system. 
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ized files were then compressed and used as the ba- 
sis for our analysis. The largest of these files (from 
the productivity server) is nearly 18 megabytes in its 
compressed form, and about 10 times that large when 
expanded. 

Originally, we used a collection of shell and awk 
scripts for all analysis. As the collected data grew, 
many of these scripts became computationally imprac- 
tical and were replaced by tailored programs. The cur- 
rent design performs the analysis in two phases. First, 
a single-pass program reads the data and extracts sum- 
mary information of interest. For example, for each 24- 
hour day in the collected data, the extraction program 
writes a single line for each user giving the total size 
of that user’s working set, measured in both megabytes 
and files. A second pass then analyzes these summary 
files with general-purpose statistical tools, generating 
the final tables and graphs presented in this paper. 


3 Statistics 


We generated the same statistics for each parameter 
in each environment: mean, standard deviation, and 
maximum. Besides the traditional measure of working- 
set size, we looked at two measures that have special 
application to mobility: write conflicts and attention 
shifts. 

We define a write conflict event to occur when two 
users write to the same file within a relatively short 
time span. In a mobile environment, a conflicted file 
might be replicated on two or more computers, and 
the system would be required to automatically resolve 
these conflicts after the fact in a manner similar to the 
Ficus distributed file system [3, 4, 7, 13], to force the 
user to resolve them by hand [6], or to limit writing to 
only one user. We examined conflicting writes within 
a 24-hour period (corresponding to taking a machine 
home overnight) and a 7-day period (corresponding to 
traveling with a machine). 

Anattention shift occurs when a single user radically 
changes his or her working set. We identified attention 
shifts by looking at the working sets in successive ac- 
tive n-hour time periods (which did not necessarily rep- 
resent adjacent days or weeks). Within each time pe- 
riod, we counted the total numbers of files accessed, ky 
and ko, and then calculated k = min(k,,k2). Within 
the second period, we also counted the total number 
m of files that had not been referenced during the first 
period, but that had existed prior to either period.* An 
attention shift was defined to occur if m > pk, where 
0 < p <1. Attention shifts can be characterized by 


4We eliminated files that were created during the second period 
because they are not problematical for a caching system that must 
predict which existing files need to be stored. 


the parameters p, expressed as a percentage, and n, 
the number of hours in the period. We use the no- 
tation p%/n to describe an attention shift parameter 
pair. Based ona sensitivity analysis (see Figures 6-8), 
we chose p = 20%. We chose n = 24 and n = 168 
(1 week) because these represent typical disconnection 
periods for many portable users. 

A final characteristic of an attention shift is the age 
of the shift, which represents the amount of time which 
has elapsed since the user last referenced one of the 
“new” files. We estimated the age by locating the most 
recently-referenced “new” file (a file included in count 
m), and subtracting its reference time from the start 
time of the second period. This is a conservative mea- 
sure, since it assumes that the most-recently-referenced 
file is representative of the entire group m of “new” 
files. 

However, since many of the newly-referenced files 
did not appear previously in the trace, it was not always 
possible to find a file to use in calculating the age of the 
shift. In this case, we conservatively assumed that the 
“new” files had been referenced exactly one second 
before the beginning of the entire trace. Because of 
these two assumptions, the attention-shift ages reported 
in this paper are only a lower bound on the true ages that 
would be encountered by a predictive caching system. 

The bounded locality intervals discussed in [9] are 
similar to attention shifts, but are parameterized on 
working-set sizes rather than on the expected length of 
a disconnection. 

The statistics we report are: 


Working-set statistics. For each day and week, we 
calculated the working set size in files, MB, and 
number of accesses. Means and standard devi- 
ations were calculated by averaging data across 
time for each UID, and then calculating the mean 
and standard deviation across the per-UID means. 


Attention-shift statistics. For each 1-day and 7-day 
attention shift, we examined the total size of the 
working set needed to hold both the old and the 
new data (in files and MB). We also calculated the 
per-user attention shift rate per day and per week. 
Finally, we calculated the age of each shift. 


Conflict statistics. For each conflict, we examined the 
number of users involved and the size of the file 
involved. We also calculated the per-user conflict 
rate per day and per week. 


Success in mobile computing depends on small val- 
ues for all of these statistics. Clearly, the working set 
must be small enough to fit comfortably on the typical 
portable’s disk. The attention-shift rate should remain 
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low, both so that the longer-period working set remains 
small and so that it is easier to predict the future work- 
ing set based on recent behavior. The conflict rate must 
remain low to allow convenient file updates. 


4 Analysis 


The results of our analysis are very encouraging for 
our intended application, automated caching of files 
for mobile computers. As hoped, working sets are 
small and attention-shift rates are low. Conflict rates 
are generally low, and it is clear how one could han- 
dle conflicts in the environments that had high con- 
flict rates. However, attention-shift ages tend to be 
high, indicating that a predictive caching system will 
need to exercise significant intelligence to ensure that 
a portable computer is prepared for attention shifts. 

Each table of statistics given below lists the mean 
for the statistic, followed by the standard deviation 
(in parentheses) and the maximum. For example, in 
Table 1, the mean daily working set for the productivity 
environment was 1.0 MB, with a standard deviation of 
2.0 MB and a maximum of 134.5 MB. 

With the exception of Figures 6-8, all figures show 
the variation in a given measure over the duration of 
the trace. For example, Figure 1 shows the daily and 
weekly working sets for the productivity environment, 
for each day and each week captured during the trace.° 


4.1 Working Sets 


Table 1 summarizes the working-set sizes we observed. 
Figures 1-4 show the variation in mean and maximal 
working set sizes with time. 

Mean working-set sizes tended to be small in all three 
environments, with the largest being about 18 MB per 
day and 24 MB per week, in the commercial environ- 
ment. Maximal working sets were very large (148 MB 
per week) only in the personal-productivity environ- 
ment, apparently due to a single grep-style operation 
that occurred in week 9. This “q31rep phenomenon” 
is clearly visible in Figure 1. Eliminating this sin- 
gle maximum produced a secondary maximum of only 
76 MB. Maximal working sets in the other environ- 
ments ranged only to 66 MB. 

These working-set figures indicate that it will be easy 
to store enough files on a portable disk to satisfy the 


5 In these and all other graphs, the lines connecting data points are 
present only to make it easier to see associated points, and are not 
meaningful in themselves. In particular, although the daily maxima 
in the right-hand sides of Figures 4 and 5 appear to exceed the weekly 
maxima, careful examination shows that only the connecting lines 
cross, and the actual data points for weekly maxima are always larger 
than the daily values. 


average user,® although some software or user behavior 
may have to change. (For example, instead of relying 
on a large grep, a user might use an inverted index 
to locate the files containing references to a particular 
string [10].) 


4.2 Attention Shifts 


Tables 2 and 3 summarize the attention shifts observed. 
Figures 6-8 show the sensitivity of attention-shift rates 
to the parameter p. Except in the commercial environ- 
ment, the number of attention shifts steadily decreases 
with increasing p, but the exact shape of the curve is 
quite inconsistent. In the absence of a clear-cut change 
in curvature (a knee or cliff), to guide us in the selec- 
tion of p, we chose p = 20%, which is near enough to 
the peak of the curves that we will not tend to underes- 
timate the number of attention shifts, yet not so small 
that we will detect a shift every time a user accesses 
one or two new files. 

Figures 9-11 show the variations in attention-shift 
rates with time, for p = 20%. The amount of data in- 
volved in attention shifts was generally small (33 MB 
or less), though the maxima were large (up to 152 MB; 
this follows from the size of the maximal working set 
and the definition of an attention shift). In all three 
environments, the number of attention shifts was sur- 
prisingly large and consistent, averaging up to 0.6 per 
user per week. This has serious implications for a pre- 
dictive caching scheme, because it shows that simply 
caching least-recently-used files is not sufficient. 

However, because of the small size of the working 
sets involved in the average attention shift, a well- 
designed predictive cache can afford to store both the 
old and the new set, so that attention shifts need not 
affect the usability of a mobile computer. 

Of course, if there is space to store both the old 
and new working set, the question arises whether a 
simple LRU scheme would be sufficient to ensure that 
both working sets are available. The attention-shift age 
figures shown in Tables 2 and 3 belie this notion. For 
both the programming and the reduced productivity en- 
vironments, the mean age of an attention shift is over 4 
weeks and the maximum is near the length of the trace, 
indicating that an LRU cache would very likely have 
been flushed by transient phenomena before the older 
files were re-referenced. This hypothesis is strength- 
ened by the observation that the conservative method 
of estimating the ages of previously-unreferenced files, 


SWe expect working-set sizes to change dramatically over the 
next few years as users move towards multimedia applications, but 
we also expect that disk sizes will increase sufficiently for portable 
computers to keep pace. In some sense, this phenomenon is self- 
regulating, since users will not tend to use images and sounds exten- 
sively if this would tax their portable storage capacity. 
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Daily Daily Weekly Weekly 


WS Size WS Size WS Size WS Size 
(MB) (Files) (MB) (Files) 
Environment Mean o Max Mean o Max Mean o0 Max Mean o Max 











Productivity 1.0 (2.0) 134.5 39 (80) 3293 2.7 (4.7) 148.4 








110 (215) 3284 














Reduced Productivity 0.7 (1.8) 41.1 7 (10) 547 1.4 (2.8) 43.6 19 (31) 548 
Programming 0.3 (0.4) 18.0 10 (27) 2153 0.6 (1.1) 18.3 22 (55) 2170 
Commercial 18.2 (13.1) 65.0 294 (442) 1643 | 26.8 (16.6) 65.7 374 (553) 1638 


Reduced Commercial 10.9 (6.0) 33.6 294 (442) 1643 16.8 (8.7) 33.8 374 (553) 1638 
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Figure 1: Working-Set Sizes for Productivity Environment 
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Figure 2: Working-Set Sizes for Reduced Productivity Environment 
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Figure 3: Working-Set Sizes for Programming Environment 
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Figure 5: Working-Set Sizes for Reduced Commercial Environment 
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Number Per 





User Per Day 
Environment Mean ao Max Mean 
Productivity 0.4 (0.3) 
Reduced Productivity 0.2 (0.2) 
Programming 0.3 (0.2) 
Commercial 0.3 (0.3) 


Reduced Commercial 0.3 (0.3) 





0.5 0.6 (1.6) 20.9 
0.9 | 21.8 (13.8) 65.7 
0.9} 14.6 (8.1) 33.8 


Involved 


: 1.6 (6.5) 135.7 
0.5 0.8 (3.2) 41.1 


Files Age 
Involved (Days) 
Mean o Max Mean o Max 


64 (164) 3296 
13 (33) 548 
16 (109) 2161 


10.0 (15.7) 64. 
26.2 (19.7) 64.7 
28.0 (21.3) 70.2 


Table 2: 20%/24-Hour Attention Shifts (All Users) 





Number Per 

User Per Week Involved 
Environment Mean a0 Max Mean 
Productivity 0.6 (0.3) 4.7 (12.4) 151.8 
Reduced Productivity 0.3 (0.2) 0.4 2.0 (5.5) 44.3 
Programming 0.4 (0.2) 0.6 1.7 (3.4) 22.6 
Commercial 0.5 (0.4) 1.0] 33.3 (17.4) 66.8 
Reduced Commercial 0.5 (0.4) 1.0 21.1 (9.0) 33.8 
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Figure 6: Attention-Shift Sensitivity for Productivity Environment 
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Figure 8: Attention-Shift Sensitivity for Commercial Environment 
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Figure 9: 20% Attention-Shift Rates for Productivity Environment 
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Figure 10: 20% Attention-Shift Rates for Programming Environment 
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Figure 11: 20% Attention-Shift Rates for Commercial Environment 
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explained in section 3, would produce a mean age of ap- 
proximately half the length of the trace (about 5 weeks) 
if there were absolutely no historical data in the trace. 
In actuality, the new working set may not have been 
accessed for many months and thus may have been 
flushed from even a very lengthy LRU cache. Other 
methods will be needed to ensure that a mobile machine 
will be prepared for an attention shift. The above data 
merely assures us that there will be room to store both 
today’s and tomorrow’s working sets once they have 
been identified. 


4.3 Conflicts 


Tables 4 and 5 show statistics about conflicts and their 
rate of occurrence, respectively. Figures 12-14 show 
the variations in conflict rates with time. Conflicts 
were very rare in the “programming” environment, av- 
eraging 0.01 conflict per user per day, and only 0.10 
per week. In nearly every case only two users were 
involved in a given conflict, although occasionally a 
third would write to the same file within 24 hours. 


As expected, the 7 users of the “commercial” en- 
vironment, with its shared accounting database, pro- 
duced a high conflict rate of 11 per user per week, 
with up to 6 users writing to the same file in a single 
day. In a mobile environment, an automated resolver 
similar to those discussed in [13] would be required 
to handle these numerous conflicts. Since accounting 
applications typically involve appending records to a 
transaction database, we expect that such a resolver 
would be easy to write. 


The surprise was the “personal productivity” envi- 
ronment, which produced conflict rates up to 1.2 per 
user per day, with up to 22 users writing to the same file 
in a single 24-hour period. We examined these con- 
flicts in more detail to discover the cause, and found that 
nearly all of them involved mailboxes or fax-spooling 
files. 


Since both mailbox and spooling files operate in a 
modified append-only mode (all but one user appends 
to the end of the file, and a simple locking mechanism 
prevents update while other file contents are modified), 
this does not present a problem for mobility. In fact, 
the retry-on-failure queuing algorithm of mailers would 
handle mailbox conflicts with no software changes. In 
view of these observations, we generated the “reduced 
productivity” trace, which omitted these files from the 
Statistics. With this change, the conflict rate dropped 
to only 0.04 per user per week, a number so small that 
it could conceivably be handled even without the help 
of automatic resolvers. 


5 Future Work 


Based on the above analysis, we expect to build a proto- 
type caching system incorporating a prediction mecha- 
nism which, by observing user behavior, will calculate 
the current working set, detect attention shifts, and 
predict possible future working sets. A cache manager 
will then ensure that these working sets are available 
on the portable computer when it is disconnected from 
the network. 

A cache miss during disconnection is a serious, often 
catastrophic event for a user who cannot continue to 
work in the absence of a critical file. There are only 
two real options for dealing with this case: 


1. Provide enough alternate working sets that the 
user can shift to a secondary or tertiary task [6, 14]. 


2. Provide a foreground or background method that 
initiates communication (most likely expensive 
and slow) to retrieve the missing file [5]. 


We plan to provide both of these options in our pro- 
totype, though we hope to rely primarily on the first. 


6 Conclusions 


The data gathered and analysis performed in this study 
strongly indicate that predictive file caching for mo- 
bile computing is a feasible approach. However, the 
data also indicates that simple LRU caching is insuffi- 
cient. Therefore, we conclude that more sophisticated 
automatic predictive file caching mechanisms will be 
required to make the file system of a mobile computer 
appear transparently the same as the file system of a 
desktop machine. We intend to investigate suitable al- 
gorithms for this purpose, guided by these results and 
by further analysis of our data. 
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Abstract 


This paper describes the architecture and implementa- 
tion of a mobile IP system. It allows mobile hosts to 
roam between cells implemented with 2-Mbps radio 
base stations, while maintaining Internet connectivity. 
The system is being developed as part of a course on 
wireless networks at Harvard and has been opera- 
tional since March 1994. 


The architecture scales well, both geographi- 
cally and in the number of mobile hosts supported. It 
supports secure short-cut routing to mobile hosts 
using the existing Internet routing system without 
change. The implementation demonstrates a robust, 
low complexity realization of the architecture, and 
provides trade-off opportunities between efficiency 
and cost. 


Measured performance of the mobile system is 
generally excellent. The system can handle a high rate 
of location updates, and routes packets almost as effi- 
ciently for mobile hosts as the Internet does for sta- 
tionary hosts. We observe reasonable TCP behavior 
during hand-offs. 


1. Introduction 


Portable computers, while quite sophisticated in many 
ways, are hampered by the lack of support for mobil- 
ity in current network protocols. The most immediate 
problem, the physical-layer link between computer 
and network, can be solved with radio. We can build 
on a long history of work in this area, such as Aloha 
[Ab 70], and more recently commercial radio hard- 
ware such as Altair [BuOdTaWh 91] and WaveLAN 
[Tu 88]. These radio systems provide limited geo- 
graphical coverage. The cellular telephone system 
[Ma 79] solves this problem by tiling the world with 
radio base stations connected by a wired network. Our 
overall goal is to adapt this idea to computer net- 
works. 


The system we describe is the product of a 
graduate and undergraduate course on wireless net- 
works taught at Harvard University in the 1993-94 
academic year. The design was completed in the fall 
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of 1993. The system has been operational since 
March of 1994. Our experimental environment 
includes IBM-compatible PCs running UNIX and 2- 
Mbps WaveLAN spread-spectrum radio interfaces. 


The next section describes goals of our system. 
Section 3 compares our system with other similar 
work. Section 4 presents the basic architecture of our 
system, Section 5 explains enhancement for short-cut 
routing, and Section 6 gives more detail about the 
architecture. Section 7 analyzes the security and scal- 
ability of the system. Section 8 discusses our experi- 
mental implementation and Section 9 summarizes 
measured performance of the system. Section 10 sug- 
gests areas for future work. The final section gives 
some concluding remarks. 


2. System Goals 


Our primary goal is that our system be transparent to 
users as they roam from cell to cell. A move to 
another office, building, or city should not affect how 
a user can use network services. The user should not 
be required to take any special action because of such 
a move. All the user’s existing network connections 
should be preserved, and there should be no differ- 
ence in the way new connections are created. 


Performance should approach that delivered 
by non-mobile protocols over the same hardware. In 
particular, short-cut routing should be supported. A 
mobile IP system should not compromise the security 
of communication between existing wired hosts at all, 
and should provide the maximum practical security 
for mobile hosts. Packet redirection mechanisms pro- 
vided for the mobile system should not be manipula- 
ble by users to deliberately cause misdelivery of 
packets. 


We also aim at some practical goals less visi- 
ble to the user. Our system should not limit the num- 
ber of active mobile hosts. No administrative domain 
should need to know about mobile hosts from other 
domains, and mobile hosts should be able to roam to 
other domains just as they roam within a domain. We 
do not require changes in IP routers or non-mobile 
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hosts, although changes to the latter are supported to 
increase efficiency. 


Some economic and social concerns are out- 
side the scope of our work. We assume that different 
organizations are willing to provide base station ser- 
vice to each others’ mobile hosts. 


3. Background and Previous Work 


A number of mobile IP systems have been imple- 
mented or proposed. All share a notion of mobile 
hosts (MHs), each of which keeps a constant IP 
address regardless of location (see Figure 1). All share 
the idea of radio-equipped base stations (called For- 
eign Agents, or FAs), which serve as temporary points 
of attachment to the Internet for roaming MHs. All 
use existing Internet routing protocols to direct pack- 
ets addressed to an MH to a stationary computer (a 
Home Agent, or HA) capable of forwarding them to 
the FA to which the MH is currently attached. The 
fundamental differences among these systems lie in 
these areas: 


(1) How does an HA know where an MH is? 


(2) How can ordinary hosts send directly to an 
MH’s current FA, avoiding the wasteful trip 
through the HA? 


(3) How do the mechanisms in (1) and (2) react to 
MH movement? 


Security, scalability, and compatibility drive 
the choices in these three areas. A mobile IP system 
should not be easily tricked into redirecting packets to 
malicious eavesdroppers. The MH location database 
should not become a bottleneck as the number of 
MHs grows, and thus must be distributed, perhaps at 
the cost of some complexity to ensure consistency. 
Finally, mobile hosts should be able to talk to hosts 
that know nothing about mobility. We call hosts that 
send packets to an MH Correspondent Hosts (CHs). 
They may be ordinary and send packets to an MHs on 
a dog-leg route through its HA, or enhanced to use 
short-cut routes direct to an MH’s FA. 


Below we compare some other mobile IP sys- 
tems to our work. We have adopted the terminology 
of the IETF Mobile IP Working Group [MolIP 93], 
though these names (MH, FA, HA, and CH) are not 
universally used, nor do they correspond exactly to 
entities in all the systems we mention. A comprehen- 
sive comparison of several of the systems is available 
elsewhere [MySk 93]. 


3.1. Columbia’s System 


The central theme of the Columbia’s system is the 
notion of a single virtual subnet to which all MHs 
belong [loDuMaDe 92] [IoMa 93]. Each MH uses a 
radio to talk to the nearest Mobile Support Router 
(MSR), each of which has both a radio and a wired 
Internet connection. Each MSR tells the IP routing 
system that it has an interface onto the virtual subnet, 
so that normal IP routers will send packets for an MH 
to the nearest MSR. 


The system operates as follows. An MH regis- 
ters with whatever MSR happens to be in radio range, 
and periodically reconfirms this registration. This par- 
ticular MSR thus knows where the MH is. When a 
CH first sends a packet to the MH, the packet is for- 
warded to the nearest MSR by normal IP routing. If 
the MH is registered with that MSR, the MSR can 
deliver the packet to the MH directly. If not, then the 
MSR must find the MH. It sends a query to all the 
other MSRs requesting the location of the MH, and 
forwards the packet to whichever MSR responds. It 
caches the MH’s location to avoid further broadcast 
queries. 


When the MH moves to a new MSR, it 
informs the previous MSR of its new location. The 
previous MSR will cache this information and for- 
ward any packets for the MH to its new location. If 
the previous MSR receives a packet forwarded by 
another MSR, it sends that MSR a redirect specifying 
the MH’s new location. This redirect updates that 
MSR’s cached location for the MH. 


The Columbia system’s strong points are that 
it sends packets by efficient routes, even from com- 
puters that are not aware of mobile hosts, and that it 
has no unnecessary points of failure. It does not scale 
well, because MSRs broadcast to each other. It does 
have a mode of operation with improved scaling, at 
the cost of inefficient routing. It has no authentication, 
and would be vulnerable to malicious location mes- 
sages. 


3.2. Sony’s System 


Sony’s system [TeUVe 93] [TeTo 93] allows both CHs 
and intermediate routers to cache MH _ locations. 
Every MH has a permanent Virtual IP (VIP) and a 
Temporary IP (TIP) address. Using the normal IP 
routing system, Sony’s scheme arranges that a packet 
addressed to the VIP will end up at the MH’s HA, and 
that a packet addressed to the TIP will end up at the 
MH’s current location. 


A mobile host is allocated a TIP each time it 
moves to a new location; the TIP is an address on a 
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radio LAN at that location. The MH keeps its HA 
informed of its TIP. When the HA receives packets 
addressed to the MH’s VIP, it forwards them to the 
MH’s TIP. 


When the MH sends a packet to a CH, it 
includes its current TIP in a special IP option [BR 89]. 
An enhanced CH is able to remember this TIP, and 
use it instead of the VIP for further communication 
with the MH. Packets sent to the TIP use a direct 
route to the MH through the Internet, avoiding the 
dog-leg route through the HA. Ordinary CHs ignore 
the option, and continue routing through the HA. 
When the MH moves and acquires a new TIP, it is not 
clear how it should notify an enhanced CH. Such a 
CH might continue sending to the old TIP until the 
MH sends it a packet containing the new TIP. 


The Sony system includes routers which cache 
MHs’ TIPs, and redirect packets sent by ordinary CHs 
to avoid the dog-leg through the HA. It is not clear 
how these caches are updated when a MH moves, 
especially in a network that includes ordinary routers. 


The strengths of the Sony system are that it 
scales well and can provide efficient routing for ordi- 
nary CHs. However, its specification seems incom- 
plete, and it provides no authentication for location 
updates. 


3.3. IBM?’s System 


An MH in IBM’s system [RePe 92] [BhPe 93] has a 
permanent IP address. Each MH has an HA, and the 
HA tells the IP routing system that it is the gateway 
for its MHs. Thus when a CH sends a packet to the 
MH, it ends up at the HA, which will forward it to the 
MH. When an MH moves to a new location, it finds a 
nearby FA, and sends the FA’s address to the MH’s 
HA. The HA tells the MH’s previous FA to forget 
about the MH. 


When an MH sends a packet to a CH, it 
includes an IP Loose Source Route option [Br 89]. 
This option records the address of the MH’s FA. The 
CH caches the FA address, and sends any further 
packets for the MH via that FA. If the MH moves, its 
old FA will forward packets from the CH to the MH’s 
HA. Any reply from the MH will carry the MH’s new 
location, allowing the CH to update its location cache. 


If all Internet hosts implemented Loose Source 
Route correctly, IBM’s system would provide effi- 
cient routing with no changes to either CHs or routers. 
Sadly, a dearth of correct Loose Source Route imple- 
mentations thwarts this elegant system. Few systems 
actually remember and use the latest source route for 
TCP, and possibly none do so for UDP; see [MySk 
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93]. Source routes are not authenticated, so if imple- 
mented correctly they could be used to redirect pack- 
ets arbitrarily. 


3.4. Matsushita’s System 


Matsushita’s mobile IP system [WaMa 93] 
[WaYoOhTa 93] is similar to the IBM scheme except 
in the way it provides efficient routing from CHs. 
When an MH moves, it acquires a temporary IP 
address. The MH then tries to find an FA (called a 
Packet Forwarding Server or PFS), and registers the 
FA’s address with its HA (which is called the “home” 
PFS). The HA also receives and forwards packets sent 
to the MH’s home address. When the HA forwards 
packets to the mobile host, it notifies the sending CH 
of the MH’s current location, so the CH can then send 
directly to the MH. 


When an MH registers a new location with the 
HA, the HA sends a packet to the old FA to de-regis- 
ter the MH and tell the old FA the mobile host’s new 
location. If a packet for the MH arrives at the old FA, 
it forwards it to the MH’s new location. The old FA 
will also inform the sending CH of the MH’s new 
location. After a time-out period the old FA discards 
the new MH location, and returns MH-bound packets 
to the MH’s HA. 


The Matsushita system appears similar to our 
system: both include MHs, FAs, and HAs, registration 
at home, and support for efficient routing. However, 
Matsushita’s Mobile IP system design does not 
directly address security issues or failure modes. For 
example, it is not clear how to perform authentication 
in this system, and the authors suggest repairing HA 
crashes by manually querying MHs for their loca- 
tions. Forwarding from old PFSs to new PFSs compli- 
cates their implementation and allows forwarding 
loops, which must be handled specially. 


3.5. Mobile IP Working Group’s 
Proposal 


This draft proposal [MoIP 93] also differs from the 
IBM scheme mostly in the way it provides efficient 
routing from CHs. Ordinary CHs always send packets 
to an MH via its HA. The HA, however, notices when 
a CH sends a packet to an MH, and notifies the CH 
that the MH is mobile. An enhanced CH then asks the 
HA for the MH’s current FA, and sends further pack- 
ets directly through the FA. To authenticate the HA’s 
reply to the CH, the CH sends a random number to the 
HA, and the HA must supply the same number along 
with the MH’s location. Only a router along the path 
between CH and HA could know this number and use 
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FIGURE 1. Example of mobile IP relocation, 
showing short-cut and dog-leg routes to the 
original MH location. 


it to forge the HA’s reply. But the CH must already 
trust all of those routers, since it is sending its packets 
through them. 


The Mobile Working Group proposal is similar 
to our system, though they were developed indepen- 
dently and at roughly the same time. As a draft, it is 
not always complete and detailed. For instance, it is 
not clear how a CH determines a trustable address for 
an MH’s HA. 


A more recent draft from the Mobile Working 
Group [Si 94] contains more detail, but omits support 
for enhanced CHs. It may thus be more secure than 
the previous draft, and is certainly less efficient. We 
argue in Section 7 that enhanced CHs need not reduce 
security below that of the current Internet, and that 
therefore this omission is not necessary. 


4. Basic Architecture 


For explanatory purposes we consider the following 
scenario for our mobile IP system: a mobile host with 
IP address 128.103.53.42, geographically from Cam- 
bridge, Massachusetts, and under the administrative 
control of Harvard University, is carried to the Uni- 
versity of California at Berkeley by its owner, Alice. 
Alice powers up her mobile host in Berkeley, in a 
wireless cell with an IP subnet number of 128.32.130. 
We identify the following four entities involved with 
providing mobile IP access to Alice’s machine: 


e Mobile Host (MH): the portable machine with 
wireless network hardware carried by Alice to 
Berkeley. It retains the IP address 128.103.53.42 
regardless of its location. 


e Home Agent (HA): the router at Harvard respon- 
sible for routing packets to mobile hosts with IP 
addresses in subnet 128.103.53. It remembers the 
locations of all MHs with addresses on that subnet. 
There is a single HA for each subnet which sup- 
ports mobile hosts. 


e Foreign Agent (FA): the wireless base station at 
Berkeley that serves as the MH’s temporary 
attachment point to the Internet. The FA has both a 
radio and a wired Internet connection, and is will- 
ing to forward packets between them. An FA may 
serve more than one MH at the same time. 


e Correspondent Host (CH): any host on the Inter- 
net, mobile or non-mobile, with which an MH 
communicates. For our example, the CH in ques- 
tion is in Madison, Wisconsin, with IP address 
128.105.252.36. 


The entities listed above are the only ones our 
system modifies. In particular, it uses the existing 
Internet routing system without any change. This is 
how our system behaves when the CH is not enhanced 
to route efficiently to mobile hosts: 


e Upon arrival in Berkeley, Alice’s mobile host 
handshakes with a nearby foreign agent. The FA 
arranges to route packets for the MH out its wire- 
less interface, and the MH starts routing all its 
packets via the FA. The MH registers its location 
with its HA at Harvard, after proving its identity to 
the HA. The HA creates an entry in its routing 
table to the MH through this FA, and sets a flag 
indicating that packets for the MH should be 
encapsulated and forwarded to the FA. 


e The CH in Madison sends packets to Alice’s MH, 
at its permanent IP address. The standard Internet 
routing system routes these packets to the MH’s 
HA, on the MH’s home subnet. The HA looks for 
a route in its routing table to the MH in question, 
and finds the route through the FA, marked for 
transport by encapsulation. 


e The HA encapsulates the IP packet from the Mad- 
ison CH in another IP packet, and sends it to the 
Berkeley FA. When the FA receives this encapsu- 
lated packet, it extracts the enclosed packet, and 
routes it through its wireless interface to the MH. 


e Alice’s MH receives the packet from the FA. 
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e If Alice moves out of the range of the FA’s radio, 
and into the range of another FA’s radio, her MH 
registers the new location with its HA. The HA 
then starts forwarding the MH’s packets via the 
new FA. Some packets from a CH may be for- 
warded by the HA to the old FA while Alice is in 
motion; the old FA discards them. Higher level 
protocols, such as TCP, should re-transmit such 
packets. 


While the above scheme allows normal IP 
routing for packets from Alice’s MH to Madison 
through the Berkeley FA and the rest of the Internet, it 
requires packets from the Madison CH to Alice’s MH 
to “dog leg” through Cambridge and then double back 
cross-country to Berkeley. While inefficient, this rout- 
ing method offers complete backward compatibility 
with existing Internet routers and unenhanced CH IP 
implementations. 


The maintenance of location information for 
MHs by their HAs, the encapsulation of packets by a 
HA, and decapsulation of packets by an FA all require 
data structure and code modifications to the IP imple- 
mentation. See Sections 6 and 8 for these and other 
details, such as crash recovery. 


5. Enhanced Architecture for Short-cut 
Routing 


We now present some IP enhancements made by our 
system that significantly improve routing efficiency 
from correspondent hosts to mobile hosts. Note that 
we maintain the invariant that existing Internet routers 
(those other than the foreign agent and home agent for 
a particular CH-MH path) require no software 
changes. Our goal here is to avoid the dog-leg route 
CH-HA-FA-MH (in our example scenario, the Madi- 
son-Cambridge-Berkeley path) in favor of the more 
direct CH-FA-MH (Madison to Berkeley) route for all 
but the first few packets from CH to MH. We modify 
the above behavior as follows: 


e When the CH sends its first packet to the MH via 
the HA, the HA informs the sending CH that the 
MH is mobile. A non-enhanced CH ignores this 
notification message; such a CH continues to use 
dog-leg routing as outlined previously. An 
enhanced CH, however, asks the HA to keep it 
informed of the MH’s location. 


e The HA remembers all CHs that have subscribed 
to MH location updates in this way. So long as this 
subscription is maintained, the HA informs the CH 
of the MH’s current FA each time the MH registers 
a new location. 


e The CH caches the location updates from the 
MH’s HA, installs the appropriate routes in its IP 
routing table (with the encapsulate flag on), and 
thereafter encapsulates packets bound for the MH 
directly to its current FA. 


6. Architecture Details 


We divide the architecture into four protocols: hand- 
off, registration, location update, and routing and 
encapsulation. Each of these protocols involves soft- 
ware that runs on more than one host; for instance, 
hand-off involves both FAs and MHs. The interfaces 
between the protocols modules on any one host are 
simple. 


6.1. Hand-off 


Each FA periodically broadcasts a beacon packet on 
all of its radio interfaces. If an MH is not attached to 
any FA and hears a beacon, it asks the FA that sent the 
beacon if it can attach. The FA accepts if it is not 
overloaded, and sends an acknowledgment. At that 
point the FA puts a host route for the MH in its IP 
routing table pointing out the radio interface, and the 
MH installs a default route pointing to the FA. 


The MH monitors the beacons from its current 
FA; if it does not hear a beacon for a while, it scans 
for other FAs. The frequency with which FAs broad- 
cast beacons governs how soon an MH notices that it 
is out of range of its current FA, and therefore how 
long its service will be interrupted before it acquires a 
new FA. 


The MH periodically tells its FA that it still 
wants service. If the FA does not hear from the MH 
for a while, it deletes the route to the MH. If the FA 
receives an encapsulated packet for an MH for which 
it has no route, it silently discards the packet. 


The FA provides service for an MH without 
any sort of authentication. This allows an unautho- 
rized MH to send packets into the Internet via the FA, 
but it does not allow the MH to receive packets unless 
they are specifically encapsulated and sent via the FA. 
The only way to arrange for the MH to receive pack- 
ets addressed to its IP address is by authenticated HA 
registration. 


6.2. Registration 


After the MH establishes a connection with an FA, it 
sends its HA a registration request. This request con- 
tains the MH’s IP address and its FA’s IP address. The 
HA replies with a randomly chosen challenge number. 
The MH signs the challenge along with the FA 
address using MD5 [Ri 92] and a secret key shared 
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with the HA, and sends this signature back to the HA. 
Upon validating the signature, the HA updates its 
routing table for this MH and sends back an acknowl- 
edgment. If the authentication fails, the HA replies 
with a denial packet. The MH must periodically re- 
register with its HA, in case the HA reboots and thus 
forgets the locations of its MHs. 





FIGURE 2. MH Hand-off and Registration. The FA 
periodically broadcasts beacons (0). The MH 
replies with an attachment request (1); the FA 
responds with an attachment grant (2). After 
attaching, the MH sends a registration request (3) 
to its HA. The HA replies with a challenge packet 
(4) to the MH. The MH sends a signed reply (5). If 
the reply is good, the HA sends a registration 
confirmation packet (6). An HA function call tells 
the update layer of the new MH location (7); this 
triggers location updates to subscribed CHs. 


The HA chooses a new challenge number for 
an MH each time the MH registers successfully. The 
challenge prevents replay attacks. It also functions 
like a sequence number, to help the MH and HA 
ignore all the but the latest messages. It is especially 
useful when the MH changes location frequently. The 
MH could save one packet exchange with the HA by 
sending a non-repeating sequence number, rather than 
waiting for a challenge; we decided it would be too 
hard to keep the sequence numbers on the MH and 
HA consistent. 


The HA requires stable storage to hold one 
registration key for each MH it serves. The key man- 
agement between the HA and his MHs is straightfor- 
ward as they are assumed to be under the same 
administrative authority. 


6.3. CH Update 


As described above in Section 5, the HA directly 
informs any CHs using dog-leg routing that the desti- 
nation MH is mobile. The HA can detect when a CH 
talks to an MH because the Internet routes the CH’s 


packets via the HA. An HA limits the rate at which it 
notifies any one CH that an MH is mobile, since unen- 
hanced CHs will never stop sending via the HA. 





FIGURE 3. CH acquisition of direct route (FA acts 
only as a bridge to the MH, so it is omitted). (1) 
normally routed packet intercepted and (2) 
forwarded to MH triggers a notification message 
(3) to the CH. The CH asks the mobile host to 
name its home agent (4); after receiving the reply 
(5), it sends a subscription request (6) to the HA. 
The HA replies with a location update (7), which 
is then installed in the CH routing table. 


In a perfect world, the HA could use one mes- 
sage both to inform the CH that a host is mobile and 
to carry the forwarding address information. Unfortu- 
nately, the CH cannot trust the contents of the notifi- 
cation message without creating a redirection security 
loophole. First, it must determine the correct address 
of the HA of the MH mentioned in the notification. It 
does this by sending a query to the MH containing a 
random number; the MH replies with the random 
number and its HA’s address. The random number 
assures the CH that the response must have come 
either from the MH or from some router along the 
path between CH and MH. The CH must trust all 
routers along this path, since it sends its data through 
them. 


After the CH has discovered the MH’s HA, it 
sends a subscription request to the HA. The HA 
replies with the address of the MH’s current FA. The 
subscription request and reply are also protected by a 
random number. 


The relationship between HA and CH takes 
place under a subscription model. The HA remembers 
all of the CHs that have recently placed subscription 
requests. If the MH changes location, it notifies all 
subscribers of the new location. If a subscription 
lapse time (SLT) passes without receiving a subscrip- 
tion request from a particular CH, then the HA 
assumes that the CH no longer wishes to receive loca- 
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tion updates. The CH must periodically resubscribe to 
the HA’s location update service in order to continue 
to receive updates. CHs determine whether a “conver- 
sation” with a particular MH is still active by check- 
ing the packet counter in the kernel routing table. 


MH 
Pee 
NY, 


FA1 





FIGURE 4. Location updates when the MH moves. 
(1) Function call from registration layer triggers 
(2) anew location message from the HA to all 
subscribed CHs. On receipt, a CH installs the new 
MH location into its route table (3). 


If the CH reboots, it will begin using dog-leg 
routes again. The HA still sends notification mes- 
sages, even for supposedly subscribed CHs, so the CH 
will go through the normal location update process. 


If the HA reboots, it will forget all current sub- 
scribers. CHs periodically re-subscribe to help 
recover from such reboots. A CH removes the route to 
an MH if its HA fails to reply after a time-out period. 
This prevents permanent misdirection by a fake HA 
which can respond to only a finite number of the CH’s 
subscription requests. 


The MH could implement the update protocol 
instead of the HA. We did not choose this approach 
because we expect some MH operating systems will 
not support monitoring of packets from CHs. 


6.4. Routing and Encapsulation 


Both CHs and HAs need to send packets to MHs by 
way of FAs. They cannot directly use the regular IP 
routing system, since it would send packets with an 
MH’s address to its HA. We use a simple encapsula- 
tion scheme for this, in which an IP packet for an MH 
is placed inside another packet, with a special IP pro- 
tocol number, addressed to an FA. A flag in each rout- 
ing table entry controls encapsulation. A CH 
encapsulates only packets that it originates, and that it 
knows it is sending to an MH. An HA acts as an IP 
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router, receiving packets from CHs that don’t know an 
MH is mobile, encapsulating them, and forwarding 
them to the MH’s FA. 


An FA knows when it has received an encap- 
sulated packet by looking at the IP protocol number. It 
strips off the outer header, and processes the packet 
inside almost as if it had been received in the normal 
way. The difference is that the FA discards the encap- 
sulated packet if it is not addressed to an MH cur- 
rently attached to the FA. This prevents routing loops. 


When an MH moves, its old FA could forward 
packets to its new FA, rather than dropping them. This 
might eliminate a few lost packets. However, it is 
unlikely that this would eliminate all loss; for TCP, at 
least, a few dropped packet are very little better than 
many consecutive drops. In addition, it would be diffi- 
cult for the old FA to authenticate the location updates 
that the MH would have to send it. 


7. Analysis 
7.1. Security 


Although mobile hosts introduce some new security 
concerns, the fact that radio communication is easy to 
intercept, disrupt, and forge is not one of them. Much 
of the current wired Internet uses media with the same 
problems. Since these issues are not special to mobil- 
ity, we do not attempt to address them. Systems such 
as Kerberos [StNeSc 88] and Privacy-Enhanced Mail 
[Li 89] can solve some of these problems by provid- 
ing privacy and authentication between applications 
at either end of the network. Our aim is to maintain 
the Internet’s current level of security for existing 
applications, and to help prevent denial-of-service 
attacks on all applications, even those with end-to-end 
security. 


One attack we face involves a fake MH trying 
to register under another MH’s address; the other is a 
fake HA sending location update messages to a CH by 
spoofing messages from a real HA. The second prob- 
lem is particularly serious since CHs do not know 
which hosts are mobile; thus the same attack could be 
used to divert traffic from a wired host. 


Security on the wired Internet is a function of 
what hosts are along the path between sender and 
receiver. More formally, if hosts A and B are commu- 
nicating then a BE (Bad Element, a network-con- 
nected computer under the control of a malicious 
user) along the path between A and B can read all 
packets from A to B, forge packets from A to B, and 
cause packets not to be delivered to B. A host not 
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along the path, however, cannot read packets from A 
to B. 


Although any host can forge a source address 
in the IP header, that is usually not sufficient to carry 
on an entire fraudulent conversation; the forger must 
usually see the replies to his messages to do any real 
harm. For instance, to send packets as part of a TCP 
session, the sender must use the right sender sequence 
number or the receiver will ignore the packets. These 
sequence numbers are allocated at connection setup 
time, using a random number generator, so that it is 
difficult for an attacker to guess a valid sequence 


number!. Much of the security of our system is based 
on the assumption that attackers cannot see packets 
between the HA and CH for an indefinitely long 
period—such attackers would be able to intercept 
traffic directly without bothering to attack our system. 


7.1.1 Security of MH-HA Registration 


Our authentication scheme is implemented by two 
protocols. The MH-HA registration protocol authenti- 
cates the MH’s identity and location to the HA. The 
HA-CH update protocol allows a CH to verify that it 
is receiving location updates from an MH’s HA. 


The registration message from the MH to the 
HA, containing the IP address of the MH’s current 
FA, must be signed by the MH to prevent imperson- 
ation of the MH by BEs. To accomplish this, we use 
an MDS signature to guarantee the authenticity of reg- 
istration messages. It is unlikely that a BE could cre- 
ate a false registration message that the HA would 
accept. Replay attacks are prevented by the use of a 
randomly chosen challenge, which is different for 
each registration. Although the true MH can effec- 
tively be denied service by interception of the regis- 
tration message, this is an unavoidable characteristic 
of any Internet connection. 


We chose MD5 over some other available sig- 
nature algorithms because it does not cause any 
interoperability problem with foreign hosts due to 
export restrictions. The MH and its HA must share a 
key which is added to the message when computing 
the MDS hash, but not actually sent over the network. 
We generate the key and store it on both machines 
when the MH is first configured. 


1. In fact, many Berkeley-derived TCP implementations use an 
easy-to-predict sequence number generator, but this should be con- 
sidered broken. 


7.1.2 Security of CH Location Update 


Packet redirection in order to avoid dog-leg routing 
creates a potential security hole. A Bad Element who 
can forge location update messages from the HA to 
the CH can cause all traffic destined for an MH to be 
redirected to it or any other destination. 


We use tickets to enforce the property that 
although BEs anywhere may be able to forge packets 
from the HA to the CH, a host can only send valid 
location update messages to the CH if it can see pack- 
ets from the CH to the MH’s subnet. This general 
security strategy is prevalent in the Internet; it is how 
NFS, TCP, X11’s magic cookie system, and DNS 
achieve their security [Su 88][Ny 92][Mo 87]. No 
administration is necessary for this security system, 
and processing overhead is very small. 


The CH sends the HA a subscription request 
asking to receive location updates for an MH, con- 
taining a ticket consisting of some randomly gener- 
ated bytes X,. Hosts not along the path between the 
CH and HA will not see X,. When sending location 
updates, the HA includes the most recent X, it 


received from that CH. The CH will only accept 
updates accompanied by the X, it generated. X, is 


chosen from a range large enough (2'28 in our imple- 


mentation) that a BE is unlikely to guess a valid X,. 


Thus, BEs not on the path between the HA and CH 
cannot fool the CH into redirecting packets for a 
mobile host. 


Recall from Section 6.3 that the CH periodi- 
cally sends subscription requests to the HA and 
updates its route to the MH based on the HA’s reply. 
Thus, even if a BE on the CH-HA path is able to see 
the most recent X,, the BE can only fool the CH for a 


limited time until the CH’s next subscription takes 
effect. To arrange for permanent misdirection of pack- 
ets from the CH to a MH, the BE would have to be 
able to spoof packets on the CH-HA continuously. In 
this case, the BE would have been able to steal ordi- 
nary data packets from the CH to the MH in the first 
place, even without using our mobile IP system. 


7.2. Scalability 


We may divide the knowledge that entities in our sys- 
tem have about other entities into two categories: 


e Static, administrative knowledge: MHs and their 
HA are presumed to be under control of the same 
administrative authority. It is assumed in our sys- 
tem that MHs know their home HA and that HAs 
know the MHs for which they route packets. A 
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HA and its MHs share keys for authentication pur- 
poses. This information is static in nature, and is 
maintained by system administrator action. 


e Dynamic, online knowledge: The entities which 
exchange location information consist only of the 
MH, HA, and CH (while the FA passes along loca- 
tion information, it does not produce or consume 
it). This location information is dynamic in nature, 
and is automatically maintained by our system 
through registration and location update messages. 


The limited number of parties who require 
knowledge in both of the above categories is a strong 
asset of our scheme; no entity in an administrative 
domain needs administrative knowledge about enti- 
ties outside its domain, and no entity needs online 
information about any entity other than those with 
which it is currently communicating. This fact makes 
our system fundamentally scalable to a large number 
of entities. 


The location update messages exchanged 
when an MH moves flow only between entities 
involved in communication with the MH. As shown 
in Section 9, the registration procedure of our system 
is reasonably fast. 


Large numbers of MHs can be accommodated 
by increasing the number of HAs and home subnets; 
this policy for network expansion is identical to that 
in practice today for wired networks. 


8. Implementation Notes 


We have a working implementation of the system 
described in this paper. 


Because our system involves changes to IP, we 
need an operating system for which we can obtain 
source code. We use Berkeley Software Design’s 
UNIX (BSDI) for IBM-PC compatible computers, 
which includes source. 


Another reason we use BSDI UNIX is that it 
supports the Berkeley Packet Filter [McJa 93], which 
can give a copy of every packet received by the sys- 
tem to a process. The HA software uses this to detect 
when a new CH starts sending packets to an MH. 


Our MDS5 implementation comes from the 
RSAREF library available from RSA Data Security, 
Inc. 


We use WaveLAN radio interfaces [Tu 88]. 
WaveLAN uses spread spectrum modulation to avoid 
interference between nearby radios that do not wish to 
communicate. Radios that do wish to talk must be set 
to the same “code.” If multiple nearby WaveLANs are 
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set to the same code, they can communicate peer-to- 
peer, though we do not currently use this feature. 
WaveLAN has a range of a few hundred feet and pro- 
vides about 2 megabits of bandwidth per second. It 
uses the same frame and address format as Ethernet, 
and uses CSMA/CA for medium access control. The 
WaveLAN interfaces fit in an ISA slot ina PC. We do 
not currently have a truly portable radio interface 
using, e.g., PCMCIA, due to the difficulty of obtain- 
ing UNIX drivers for them. 


8.1. Kernel Changes 


Four UNIX kernel changes are needed to support the 
system. BSDI already allows two hosts with different 
IP network numbers to talk to each other over the 
same physical network; this situation arises when an 
MH talks to an FA. The only problem is with broad- 
casting beacon packets. The only universally accept- 
able IP broadcast address has all bits set. However, 
UNIX cannot determine on which network interface 
to send such a packet. We added a socket option to 
specify the routing table entry to be used when send- 
ing packets from a socket; in this case we would spec- 
ify a route pointing to the desired interface. 


We added encapsulation code to the kernel, 
controlled by a flag in each routing table entry. If the 
flag is set for the route a packet would use, the packet 
is encapsulated by adding a new IP header with a spe- 
cial protocol number. The encapsulating packet is 
addressed to the destination in the gateway field of the 
routing table entry, and is then routed in the usual 
way. A host knows it has received an encapsulated 
packet by the IP protocol number; the host strips off 
the encapsulating header, and processes the inner 
packet as if it had been received in the usual way. 


This encapsulation mechanism suffices in a 
CH, and in an HA after an MH has registered. How- 
ever, before an MH has been authenticated, its HA 
still needs to send it encapsulated packets. It cannot 
create a routing table entry for a potentially fake MH 
because that would divert packets away from the real 
MH. So we use the per-socket routing option 
described above during registration. 


To prevent packets to un-registered MHs from 
being forwarded by the HA using the usual IP routing 
system, we added a special network interface that dis- 
cards packets. The HA configures that interface with 
the network number used by the MHs it manages. The 
host routes installed for each registered MH override 
use of this interface. We do this in preference to giv- 
ing the HA a real radio interface for its MHs so that 
packets for un-registered MHs are not broadcast to 
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anyone listening to the radio. The presence of this 
special interface also causes the UNIX routing dae- 
mons to announce the MHs’ network number to the 
Internet routing system. 


UNIX caches a route for every socket, which it 
keeps using until the route is deleted from the routing 
table. When a CH first receives an MH’s location 
from an HA and installs a route for the MH, the routes 
cached by any sockets already connected to the MH 
are not affected. So while new sockets connected to 
the MH will use the efficient route, the first socket to 
send to an MH will continue to send via the HA. We 
have partially fixed this problem, but the UNIX IP 
code does not make a clean and complete solution 
easy. 


8.2. Software Structure 


Most of the software in our system runs as daemon 
processes, with a different type of daemon for each of 
the four entities (MH, FA, HA and CH). The daemons 
communicate across the network with UDP. 


We designed and partitioned the system to 
make it easy for a group of students to implement as 
independent modules. This has worked well in most 
cases. For instance, we require one process to run on 
the MH, which combines modules for hand-off and 
HA registration. The interaction between them is lim- 
ited to a function call made by the hand-off module to 
tell the registration module the IP address of the MH’s 
current FA. The modules at both ends of each proto- 
col, such as MH/HA registration, were implemented 
by the same group. 


In some cases this modularity works badly. 
One might want to make a single computer an HA, a 
CH, and an FA. The three modules cannot just be exe- 
cuted on the same computer. All three modify the 
routing table, and the modifications may conflict. 
Worse, the FA adds routes for MHs without any 
authentication. Usually this is not harmful, since an 
MH still has to register with its HA to receive any 
packets. But if the FA is also the MH’s HA, the MH 
will receive packets without registration because of 
the route added by the FA. We could solve this by 
tighter integration of the FA and HA modules. 


A class of a dozen students implemented this 
system in a month of programming. 


9. Measured Performance 


We have measured performance of our mobile IP sys- 
tem in three areas: TCP throughput, TCP delay during 
hand-off, and registration speed. The computers 


involved in our experiments were 66 MHz 80486 
PCs. The HA, FA, and CH were connected by a single 
isolated Ethernet segment, with no other traffic. 


Table 1 shows TCP throughput over three 
routes. The short-cut route performs significantly bet- 
ter than the dog-leg route, and approaches the perfor- 
mance observed on the radio link alone. 


LP 
Throughput 
Dog-leg Route 1.1 Mbps 
CH->HA->FA->MH 


Short-cut Route 
CH->FA->MH 


1.3 Mbps 
Route over Radio Link Only 1.3 Mbps 
MH -> FA or FA -> MH 


TABLE 1. TCP throughput comparisons. 





Table 2 depicts the impact of hand-off time on 
TCP delay. Even if an MH moves between FAs with 
overlapping radio ranges, there will be some amount 
of time during which packets sent by a CH to the MH 
will not be delivered. This includes time for the MH 
to realize it has lost contact with the old FA, for the 
MH to scan for a new FA and attach to it, for the MH 
to register with the HA, and for the HA to send a loca- 
tion update to the CH. Some packets will be lost dur- 
ing this time, and must be retransmitted after an 
additional time-out interval by higher protocol layers 
such as TCP. Previous work [Calf 93] has suggested 
that short hand-off times can result in disproportion- 
ately long interruptions in TCP traffic. Our experi- 
ments, summarized in Table 2, indicate that the 
expected interruption in service on a TCP connection 
is little more than twice the dead time. This is consis- 
tent with the fact that TCP doubles its retransmission 
time-out on each consecutive failed retransmission. 
Some of TCP’s behavior shown in the table is due to 
its minimum retransmission time-out of one second 
and timer granularity of half a second. 


Hand-off Time TCP Delay 
(Seconds) ,_ aeeeonds) 





TABLE 2. TCP delay as a function of hand-off time. 
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Table 3 depicts the results of some stress tests 
measuring the speed of the registration process. A 
registration takes no more than 20 milliseconds 
elapsed time. This is equally divided between CPU 
time and transmission time. 


# Registrations 
per Second 
MH registration to HA 
without CH subscribing 
MH registration to HA 
with one CH subscribing 














TABLE 3. Registration speed. 


10. Future Work 


Areas that warrant further investigation include 
improving the security of location update messages, 
optimizing hand-off for special cases, and load bal- 
ancing for FAs in overlapping cells. We briefly 
explain two of these areas. 


As explained earlier, our location update mes- 
sages from the HA to the CH are vulnerable to spoof- 
ing and replay by persistent malicious hosts along the 
path between the HA and the CH. We could use digi- 
tal signatures to provide better security for these 
updates. This would require a key and certificate man- 
agement, storage, and distribution architecture to 
guarantee that CHs verify signatures with the correct 
keys. 


Hand-off in our system is not particularly fast, 
as it requires the mobile hosts to scan channels listen- 
ing for beacons from foreign agents. More coordina- 
tion among FAs and MHs might allow them to locate 
each other faster. 


11. Conclusions 


The existing IP routing system makes no provision for 
mobile hosts; it cannot react to rapid changes in net- 
work topology, and its global knowledge of topology 
cannot scale to the size required to track individual 
hosts. We have presented the architecture and imple- 
mentation of a solution to this problem. It makes use 
of IP routing and the Internet infrastructure without 
modification. It maintains a database of mobile host 
locations, partitioned in a way that allows scaling. It is 
backward-compatible with existing hosts, but gives 
the option of increasing routing efficiency by adding 
short-cut routing to host IP software. Neither the loca- 
tion database nor the host IP modifications decrease 
security below the level provided by today’s Internet. 
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Our system turns out to be quite close to the 
overall direction outlined in a recent draft [MoIP 93] 
of the IETF Mobile Working Group. It appears that 
we have one of the first working implementations of 
the architectural approach being pursued by the 
Group. Our implementation demonstrates the practi- 
cality of the approach, including secure short-cut rout- 


ing. 
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